DNA - storing masses of data
It’s one thing to store a PC’s worth of data - records, finances etc., but what about on a much bigger scale? Tim Revell spoke to Tim Cutts, head of scientific computing at the Wellcome Sanger Institute in Cambridge where a large amount of the human genome was sequenced, First up, Tim R asked Tim C how DNA and genomes relate to the subject of data storage...
Tim C - Your genome is actually data. It is the software which your cells use to build you and run you. In order to understand human disease well, we need to sequence genomes, compare them with what people’s medical state might be, and from that we can determine what mistakes might be in the genome which lead to a particular disorder. That’s what the Sanger Institute does.
Tim R - When you say sequencing, that’s reading and understanding this software that is written in our genome?
Tim C - Right. And the technology that’s being used to do that has become enormously faster, very quickly. The original human genome project took around ten years and cost a billion dollars. It was a massive international project. The Sanger Institute now has the capability of sequencing about 61 whole human genomes a day.
Tim R - That’s a lot of software to get through!
Tim C - It is an awful lot of software for us to get through.
Tim R - If you have ever tried to read software, it’s actually quite difficult unless someone’s very nicely written next to it what all the lines of code mean. But with DNA there isn’t notes written next to each line of code so how do you go about understanding what’s written in the DNA?
Tim C - The first thing we have to do is gather a lot of samples because, if we only have one and we compare with what your medical condition might be, we will find that there are lots of places where you are different from someone else. But we don’t know which of those changes actually causes your issue so we have to do it lots of times for many people, and then we get an idea of well, we’ve seen that one before, now we know where we’re going. So that’s where the scale comes from, we have to compare a lot of these things in order to find an answer.
Tim R - What are the challenges in that? There’s a lot of information in a genome and you’ve got a lot of genomes.
Tim C - The first one is that the genome is quite large - it’s about 3 billion letters long. And we also have to sequence it multiple times in order to get a good sense of what each genome looks like, so we end up with about 50 gigabytes of data for each genome.
Tim R - Wow! That’s a lot of information.
Tim C - Then you’re doing 60 of those a day and that starts to build up into quite a large dataset.
Tim R - With that dataset, what do you do? I mean, that’s not just humans reading each line and hoping to find a pattern, how do extract any sort of meaning from that?
Tim C - As we were hearing earlier, we also have a supercomputer to do it. We have to do very similar sorts of calculations. We’re quite lucky, our calculations are mostly that each individual processor can run on a separate problem at the same time - a so-called embarrassingly parallel problem. But that’s essentially what we do and those things are running, just as we heard earlier. We have programmers whose job it is write the code to do that analysis.
Tim R - If you were able to understand it all, what would that tell us about being human or what would that be useful for?
Tim C - The big goal for us at the moment really is to design better precision medicine. That’s where we would really like to get to. So if I can find out how you differ from another person I can then say right, you need a particular kind of medicine. That one won’t work for you but that one will and that’s where we’re really trying to get to at the moment.
Tim R - What do you need to get to that?
Tim C - We need a very very large amount of storage. We currently have 50 petabytes - a petabyte is a thousand terabytes. Many of you will have a roughly a 1 terabyte hard disk in your PC at home, so that’s where the 50,000 home PCs comes from. The systems that we’re using have to be extremely fast. Hard disks are actually quite slow; they can only read data at about 100 megabytes a second. So feeding a supercomputer with data at the right speed also requires, not just for the capacity reasons, but you need to use an awful lot of disks just to get the data into the processors. They’re very hungry and they’re very quick.
Tim R - Tim, the Sanger was a forerunner to much of this big data processing. What problems did you run into?
Tim C - We quite rapidly discovered that we were trying to do things at a scale that the equipment that we were buying wasn’t really designed for. So we had to work very closely with the vendors to improve both their hardware and their software to do what we needed it to do.
Tim R - What sort of things did you need it to do? Solving those problems - were they very technical or did it turn out to be useful later on as well?
Tim C - A lot of them became widespread solutions, particularly the software solutions. For example, when you’re dividing up the work to run on the large supercomputer, that works a bit like a post office queue - “the cashier number one please”. The machine says I’m ready for work and you give it work. But we found that we were giving it so many tasks to do that it just couldn’t cope with the number we were giving it. So we worked very closely with that particular company and they improved the software and that’s in use in supercomputing centres all over the world.