Could DNA be the information storage solution we desperately need?

In a world where data storage is becoming increasingly difficult, can we look within our own cells to find a solution?

09 February 2017

DNA

Could DNA be the information storage solution we desperately need? Well, yes, of course it can! Since the first single-celled organisms appeared on the planet 3.5 billion years ago, deoxyribonucleic acid, or DNA, has contained the information to initiate and sustain life in almost every living thing. Although interesting, this isn’t news. Instead, the exciting, recent developments involve using this already-proven method for our own digital storage needs. So how can this ancient molecule be a viable option to solve our current information archiving crisis?

Inside the cells in every type of organism sits a recipe book for all the proteins it can make. And proteins are vital for cell survival, from the enzymes required for chemical reactions, to the hormones and receptors required for cell communication and the structural components that hold cells, tissues and organs together. Each recipe, or gene, can be turned on and off by each specialised cell type depending on the organism’s needs at the time. As such, DNA contains the information not only to make a protein, but also the control systems to determine when and how it is synthesised. Additionally, the DNA in an individual is obtained from both parents and thus the genetic data includes an important source of variation and determines physical characteristics such as eye colour and susceptibility to some diseases. Therefore, DNA and the information contained within it, are critical to the survival of living things.

The data are programmed within DNA as a string of 4 bases – Adenine, Thymine, Guanine and Cytosine – and every cell contains the mechanisms required to read and translate this code to make proteins. These bases are ordered and fixed into a long strand, which then pairs up with a complementary strand and twists to form the famous double helix structure. Analogous to digital information being coded for in binary code of 1s and 0s, genetic information is stored in the order of As, Ts, Gs and Cs along each backbone of the double helix. The number of bases, and therefore the amount of information stored in each cell, varies between species, with the smallest bacterial genome containing 1.1x10⁵ base pairs (bp) and the largest known genome, the Whisk Fern, containing 2.5x10¹¹ bp. (By the way, it’s interesting to note that the size of a genome and the complexity of the organism don’t have a direct relationship, a phenomenon known as the C value paradox.) Humans are roughly central on this scale, with each cell containing 3 billion base pairs. To put this into context, each human cell contains around 2m of DNA - the equivalent of 1.5GB of data - packed into a cell nucleus roughly 6 thousandth's of a millimetre (6x10^-6m) in diameter. If we look at the human organism as a whole, composed of 50-100 trillion cells (estimates vary), there’s enough DNA in a person to stretch to the Sun and back several hundred times. Moreover, the data capacity of the average human is a jaw-dropping 150 Zettabytes (10²¹ bytes).

So, with this ready-made information storage mechanism already present 100 trillion times over within our own bodies, as well as the proteins required to synthesise and translate the code, what’s to stop us adapting and using this technology for our own purposes? In the last 25 years, man-made technologies for making and sequencing DNA have become much faster, cheaper and more widely available. However, reading and formation technologies still suffer from high error rates, and are still too expensive for universal use. On the other hand, unless all forms of life on the planet were to die out overnight it’s not like the medium will become obsolete, and we know individual DNA molecules can exist for thousands of years without damage. Do the benefits of this innovative mechanism allow it to be a viable contender to solve our current data storage problem?

We are in a global data crisis. It is currently suggested that we will have 44 trillion GB of information in our digital archives by 2020. This includes everything from Facebook profiles and Instagram posts to every scientific paper and blog entry; it will also have grown tenfold since 2013. The thing is, if this was all stored on silicon microchips (such as those inside memory sticks), by 2040 we would need a predicted 10-100 times more silicon than is expected to be available. Indeed, a large amount of the digital archive is still stored on magnetic tapes as this allows information to be stored at a higher density, although it is slower to retrieve information when it's needed. Although this is more sustainable than silicon, it will soon become far too expensive and consume too much energy to build and maintain the correct facilities to store the miles of tape required. Assuming we want to keep it, we are in desperate need of a better method to save our masses of data.

It was while discussing this issue in 2011 that Nick Goldman and his colleagues came up with a solution. Goldman works at the European Bioinformatics Institute where they deal with massive amounts of data every day as laboratories around the world constantly submit DNA sequences from thousands of species. Scribbling on napkins in a bar after a day’s work, the team realised that the “sci-fi idea” of using DNA itself to store the data was not just brilliant, but it could actually work! They then designed a simple experiment and by coding 739kB of data including Shakespeare’s sonnets, a picture of the building where they worked, an audio clip of Martin Luther King’s “I Have a Dream” speech and a PDF of Watson and Crick’s paper unveiling the structure of DNA, they proved that the idea did in fact work.

The process is as follows. A digital file made up of binary 1s and 0s is converted into a base 3 coding system with 0s, 1s and 2s. The first number in the code is arbitrarily assigned to one of the four DNA bases, with the following 0, 1 and 2 assigned to the other 3 bases. For example, if the first base was T, the next 0, 1 and 2 will represent A, G and C. If the next base was a G, that would be followed by an A, T or C so there are no repeat letters. Eventually, each byte of digital data (a string of eight 1s or 0s equivalent to a character of text) is represented by 5 DNA bases. A sample is made of a number of short strands of around 100 bases which are offset by 25, 50 or 75 bases, along with a short end sequence to help order the fragments correctly. This results in 4 copies of all the code which allows errors to be corrected during sequencing.

Scientists have a long history of programming data into DNA, mainly for interest just to see if it could be done. In 1988, artist Joe Davis working in a lab in Harvard converted a 35-bit image of a Germanic rune and was the first person to convert binary data into DNA sequence. The symbol, representing life and female Earth, was then inserted into the genome of an E. coli bacterium. Since then, scientists around the world have converted information from all media to DNA and been surprised by the results. My favourite example is from 1986, when husband and wife team Susumu and Midori Ohno converted Chopin’s Nocturne Op. 55 No. 1 into DNA sequence using their own coding method. They found the sequence to be similar to part of the sequence coding for a DNA polymerase, one of the key enzymes involved in DNA synthesis. This completed a very satisfying “circle of life,” as DNA contains all the information to essentially create Chopin, who then composes a sequence used to create DNA.

Looking forward to how this technology will be used in the future, the most obvious benefit is the compact nature of DNA. Data can be stored in DNA at a density of 10¹⁹ bits per cm³, compared to 10¹⁶ bits per cm³ in a standard flash memory stick, and 10¹³ bits per cm³ in a hard disk. Allegedly we could store all of the world’s digital information in 1kg of DNA. It’s hard to visualise something so tiny, but according to Goldman, a vial of DNA around the size of your little finger can hold as much information as 1 million CDs. Using this analogy, the entire digital archive could fit in the back of a van.

Furthermore, as long as the DNA is kept cool, dry and in the dark, it is possible to keep samples for thousands of years without much effort. The DNA of ancient plants and animals has been retrieved and sequenced from fossils up to 700,000 years old. It is hoped that with carefully prepared samples this date could be extended. Facilities already exist to preserve seeds which are underground in the Arctic Circle, and the correct conditions can be maintained at a fraction of the cost required to power a magnetic tape storage facility. Unlike most digital data storage, this information is less subject to bit errors, when digits in a binary code can switch (i.e. a 1 becomes a 0) over time based on the physical state of the storage medium. This means there is less need for backups, which consume high amounts of time and energy in most data storage centres. Additionally, DNA is trusted to not go obsolete since we’ll always have a scientific interest in our genome. Even in an apocalyptic situation, Goldman hopes that the future humans will find this DNA, realise that it’s completely different to any natural DNA ever found, and investigate further.

However, there are still many issues with the technology, meaning it’s unlikely to become mainstream any time soon. First and foremost is the cost. In 2013 it was predicted that the cost to code 1MB of data as DNA was $12,400, with an additional $220 per MB to read the code back. With any luck, the technology to synthesise DNA will develop in a similar way to its sequencing, in that it advanced even quicker than Moore’s law (the rate of computer progress) predicted. The human genome project - the global effort to sequence all of the DNA in a human and running from 1990 to 2003 - cost $3 billion and required hundreds of scientists with very sophisticated lab equipment. Today the same process costs $1000 and can be completed in a couple of hours using a device that plugs into the USB port of a laptop.

The technologies used for both synthesis and sequencing can still have high error rates, but the methods of forming code as discussed earlier help to limit this. By forming multiple copies of each sequence and by forming codes with no repeats, the machines make fewer mistakes and any errors can be identified and fixed more easily. This still isn’t perfect, and in Goldman’s original tests some sequence was indeed lost, but it’s a lot better than the same methods used in natural human DNA.

Finally, unlike microchips, DNA is not like random access memory (RAM) in a computer that is quick and simple to read. To retrieve the data from a DNA molecule, the whole sequence must be decoded. This can be overcome by forming small samples which are well labelled with the information contained in each, but that still requires a database which can become tricky with the volumes of data we could be concerned with.

In July 2016, researchers including a team at Microsoft and the University of Washington stored a massive 200MB of data in DNA, the highest volume of information used in this field to date. This included the 100 top literary classics, the Universal Declaration of Human Rights, a database of known seeds and an HD music video. The team are now working on applying computer science tricks such as error correction calculations to help advance the technology and improve accuracy.

It has to be said that this won’t be a conventional method for data storage any time soon, and we’ll definitely still be buying Blu-rays for a while longer. However, at the rate at which we’re producing data and at which the technology is progressing, I’d keep your eyes open in the next few years for something which could be a very exciting advance for both biology and digital archiving. So yes, DNA could be used for information storage in the modern sense, just as it has been doing in organisms for the last few billion years...