Storing secrets in DNA
Last month scientists used CRISPR technology to sneak a strand of DNA into bacteria encoding a string of genetic information that could be decoded, once the bug’s DNA was sequenced, and read back into digital information encoding a short movie - a famous early film of a galloping horse.
And just last week, a team of researchers announced they had managed to encode a piece of malicious malware code into DNA. When this DNA was fed into a sequencing machine, it hacked into the computer programme responsible for analysing the DNA coming from the sequencer.
Nick Goldman at the European Bioinformatics Institute was one of the first people to have the idea of storing digital information in the form of DNA, around five years ago. Kat Arney asked him where did it come from and how do they do it?
Nick - The idea came from meetings where we were talking about the problems of storing all the information in cellular DNA that we now, as scientists, are learning to read and interpret to understand about molecular biology and health, and all the different aspects of genome research. But our work at the institute here includes storing all that information, making it available for the scientific community and that means we need large data centres and we have trouble managing those.
At the end of a long day of meetings we were having a joke in the bar about wasn’t there some other way of storing information, and DNA itself is a way of storing information and that inspired us to hatch this plan of doing an experiment to show that you could store any information in DNA just as you can store any information in your computer or on your phone or any digital device with a memory.
Kat - Because the information that’s in my phone or in my computer that’s in binary – it’s ones and zeros - and the DNA code is four. You’ve got A,C,T and G in nature, that’s the code you’re using.
Nick - In nature, there’s a very specific code in which the As, Cs, Gs and Ts hold the information of how a protein should be made and a number of many other genomic functions as well, controlling what goes on in cells. So that code is no use to us. That code doesn’t store zeros and ones which we use to make MP3 files or PDFs or whatever it is.
So one of the things we had to do was invent our own code where the input would be the 0s and 1s from a file on the computer. And the output would be a series of As, Cs, Gs and Ts which, combined with the code we had just invented, would allow us to store that zero and one information - the binary information - and later on to recover it again.
Kat - So then to de-encrypt it, so you take any string of DNA and it will spit out this binary computer code at the other end.
Nick - Not any string of DNA. We put in all sorts of special constraints that would help with error correction and would reduce some of the problems we knew would occur as a consequence of the techniques in molecular biology we would use. So in fact, not any string of DNA would decode to meaningful series of zeros and ones, but essentially any one that was holding information in our code could be converted back.
In a sense, it’s very straightforward. If everything worked with no errors, it would be very, very easy and very fast to do that. And of course, the devil is in the detail and what we actually spent all our effort doing was devising a system that would be robust to the various errors we knew would happen along the way.
Kat - And how robust is it? What your sort of error rate and how does that compare with say, other systems where you really don’t want errors?
Nick - We mostly did a proof of principal experiment so far. And I'm proud to say our error rate is 0, but we were very conservative with the way we device the code. We didn’t make a code in the way you would for a mass production system. It’s not like in a mobile phones where they know there will be a certain amount of loss and we’re kind of used to that.
We didn’t want to do a demonstration which had a bit of lost information and showed that it was kind of possible. So we were really very, very careful with how robust we made our code to errors, with the intention of getting it entirely right. In the real world for mobile phones, we tolerate a certain amount of distortion and we can live with that. But computer data centres talk about 99.99999 per cent accuracy.
And so, what we’re working towards will be codes that will try and achieve about that level. I mean, 100 percent accuracy is not realistic but we try and get in the future for large amounts of information. The idea will be to get exceedingly low error rates. And we’re optismistic that should be possible.
Kat - The thing about this is like you know, you’ve got 0s and 1s binary code, and we’re talking about letters. We’re talking about As, Cs, Ts, and Gs of DNA. But these are molecules. These are chemicals – adenine, thymine, cytosine and guanine - and DNA is a molecule. How do you turn that binary information into a molecule? So, we’re literally talking like a string of DNA that has this sequence. How do you make that?
Nick - Well, there are companies who do exactly that and it’s a difficult process. Humans are not as good at doing that as the cells of living organisms are, but we’re getting there. Roughly speaking now, the state of the art is machines that can make molecules that are about 200 letters long. One of the problems we had to overcome was that one of those is not nearly enough information to encode an MP3 or PDF.
And so, we had to device a system which would use many fragments of DNA which in combination held all the information we needed. And this introduced yet another problem which is that when you read them back, they don’t come in any specific order so we have to add a kind of indexing procedure to it. So that when we did read each piece back, before we decoded the information in that piece, we first read the bit that said, “I belong in file no. 23 and I belong – the information is at position 797 in that file.”
Kat - And how do you then read it back out again?
Nick - So reading it is actually slightly easier. Humans got better at that. And that’s all been driven by genome sequencing projects because there’s a huge demand to read the genomes of living organisms particularly humans, but any living organism has a genome for us to read. And there’s been a number of changes in the technology over the last few years on how we do that and there’s more changes in the pipeline that we know are coming in the next year or so.
But the aim is always the same, that given pieces of DNA we can read those and read back the sequence of letters. Normally, we’re doing that to understand our genomes, look at the differences between one human and the other, to understand the causes of genetic diseases, and so on. But for our purposes, exactly the same technology gives us back the series of letters which is what we need to decode to recover information.
Kat - And it sounds wonderful but not quite as simple as the read, write of a computer’s hard drive.
Nick - It’s not quite as simple, but bear in mind there’s a lot of stuff going on in your computer and in the disk drive that they don’t tell you about because no one wants to know. No one wants to know that there’s constantly errors on their hard disk drive and that there’s a very complex machine in there, moving components around and whizzing things around in circles, and reading them. You don’t need to think about that when you use a hard disk drive because they are now very reliable and so on.
So partly, it is more complicated but a lot of that is simply, we’re learning as humans, as scientists to manipulate DNA and write it and read it. But we’re still not as good at that as we’re going to be in the future. Of course, one of our aims on this project is to make DNA information storing devices which we can use much more like a hard disk drive or USB stick, or something like that.
Kat - So here, you’ve got a huge data centre. I’ve seen it – it’s enormous boxes with blinking lights and all these kinds of things. What would a future DNA storage centre look like? It would just be loads of little tubes?
Nick - Probably or even smaller than that if you could – in the future, we may be thinking you’d be able to read these on something more like a computer chip, so perhaps it would be a small device, maybe like a bank of hundreds or thousands of USB stick type devices. But inside each one, as well as electronic components, there’d be amounts of DNA and there’d be some microfluidics going on inside there that would move the samples to the region where the reader would be.
Perhaps if you needed to recover the information, you’d go on in there and pick one off the shelf or may be a robot would do it like in a tape archive centre. It’s mostly robots that go and select the tape. We’d have a robot that would select the little USB stick type device and plug it into a machine that would take a tiny sample out, read the DNA, and use a computer to process that information.
Kat - So one day, all this information that we’re getting about genomes from around the world – humans, animals, pathogens – all that information about DNA, from DNA, could be stored in DNA.
Nick - Yes, strangely and of course, all the other information in the world that we’re storing. All of Facebook’s archives of everything anyone has ever posted, any large data application you can think of. And we’ve had a lot of interest in this technology because the storage of the information has a much smaller energy requirement so it could save money in the long term.
And also, a much smaller amount of space is needed. DNA is incredibly dense information storage medium and so, the amount of information you store in something the size of your finger is thousands of times as much as you can store in an equivalent amount of hard disk.
Kat - Nick Goldman from the European Bioinformatics Institute.