DNA: The next big thing for data storage?
Remember floppy disks?
Maybe you don’t. Or maybe you’ve blocked them from your conscious. Floppies were 5 ¼-inch memory storage disks introduced by IBM in the 1970s; the fancy ones held–sit down for this–360 kilobytes of data.
What could you put on one of those magical disks, which were so cool that you could put them in your pocket and walk from computer to computer? Does anyone even know what a kilobyte is anymore? Well, you could not put a typical mp3 song on one–those puppies suck up 3 megabytes. No, we’re talking somewhere on the order of a few Word documents.
Anyway, so floppies had their limits, and data storage eventually moved on to CDs, then DVDs, then Zip drives, Jaz disks, and then SD cards. USB flash drives reigned supreme–after all, you could clip one (or a dozen, you spaz) to your key chain–until companies like Amazon, Google, and Dropbox unveiled cloud storage to the masses in 2011.
Hand over $9.99 a month, get 2 terabytes of storage. Yeah, the cloud is cool and all, but it has a few problems–chief among them is that we may soon run out of it. Because, you see, data stored on the cloud are not actually on a “cloud” (which would be totally cool if true, BTW), they are parked in data centers spread around the world, mostly in Europe, the U.S., and the Asia-Pacific, often in places where electricity is cheap, because those data centers need to be kept cold.
“When we think of cloud storage, we think of these infinite stores of data,” Hyunjun Park, CEO and co-founder of data storage company Catalog, told Digital Trends. “But the cloud is really just someone else’s computer.”
That’s right–while you can supersize your cloud account in one click, humanity is nearing the apocalyptic day when digital storage will run dry. Plus, even if there was a place to put all your Insta photos, those “places” break down–rotating disks generally work for just five years before kicking the bucket, for example; tapes, a little longer. So, basically, in other words, to wit, more or less: we’re doomed.
Or are we?
Because in the race to place more data onto ever-smaller pieces of hardware, scientists like Park have turned to one of the smallest pieces of hardware imaginable: deoxyribonucleic acid.
Measuring storage in petabytes per gram
Clelland, et al. demonstrated data storage on DNA in 1999, when they encoded then recovered a 23-character message from a piece of human building block. In 2013, the message length shot up to 739kB. While computer memory is stored as 1s and 0s, here, A is coded as OO, C is 01, G is 10, and T is 11.
The appeal of DNA, explained Yaniv Erlich and Dina Zielinski in Science, is both its capacity and density, plus its durability. Remember that floppy? Its lifespan was about the same as your car battery. And all the music it could (not) store? With DNA, the lifespan is measured in centuries and storage is measured in petabytes–that’s a million gigs–per gram. Put another way, 100 million HD movies stored on DNA would take up the same amount of space as a pencil eraser, noted Synbiobeta.
But, as Erlich and Zielinski found, the process has some big red flags that need to be worked through:
- The repetition in coding procedures generated a massive loss of information content, since many of the nucleotides carry basically the same beta.
- Repetition is not scalable. As the files grow, even a small dropout probability can mean corruption of the stored info down the line.
- There are small gaps in the retrieved information.
Their response? The “DNA Fountain.” The fountain, as the team described, preprocesses a binary file into a series of overlapping segments, then iterates over two computational steps, and finally puts each iteration onto a droplet before transmission over a noisy channel. By sequencing and decoding the oligo pool, the team recovered the entire input file completely error free.
So how much does it all cost? That’s a huge issue, but one which is at least in the process of being solved. As Erlich and Zielinski noted, “The reduction of DNA synthesis exceeds Moore’s law, meaning that large scale DNA storage might be economically feasible in the next three years.” (Erlich and Zielinski published the DNA Fountain report in September 2016, so that prediction was somewhat ambitious.) Currently, using DNA to store 1 minute of high-quality stereo sound would cost almost $100,000, according to Wired.
Park is on the forefront of making the DNA storage effort more feasible. Catalog, Wired added, is “building a machine that will write a terabyte of data a day, using 500 trillion molecules of DNA.” Park says the Catalog project could come down by separating the writing of DNA from the process to encode it, and would cheaply create large quantities of short DNA molecules–30 base pairs, rather than previous attempts’ use of 200 base pairs. Using this process, DNA storage could approach tape storage cost-wise within a few years.
Turning to oPools for low error rates, low dropout rates
IDT is at the leading edge of research into this new form of data storage, said Adam Clore, the company’s director of synthetic biology.
“We have high-quality DNA that can be reused–our quality is unsurpassed–and we synthesize at a scale large enough so there can be quite a bit of savings in there,” said Clore. “If other companies make enough to be used once, and we make a million times more, we have a million-fold advantage.”
IDT’s foot in the DNA storage effort is with oPools Oligo Pools, which offer high fidelity, uniformity, low error rates, and low dropout rates.
How does it work? As noted, scientists translate the 1s and Os of data into As, Ts, Cs, and Gs, then synthesize this code onto a molecule. To retrieve the data, PCR hunts for the targeted section of the sequence, which is then replicated, sequenced, decoded, and adjusted for errors. Because the process is error prone, redundancy is used to ensure the correct data are read, a step earlier methods did not use.
But while possible, there are fundamental problems with the use of DNA for data storage which must be overcome before the method can go mainstream.
Currently, said Clore, oPools prices are roughly a penny per base. Each base can host a byte of data, but with anyone in America able to walk into BestBuy and purchase a 1TB hard drive for $50, the economics are just not there. Added to that is the cost of reading data stored on DNA: currently, that requires an Illumina sequencer, which will run you about a quarter-million dollars, plus $4,000 to $5,000 per run.
“The cost is astronomical,” he said. “Even with the cheapest systems, the cost (to use DNA) would have to come down 1,000-fold.” Further, he added, reading data from DNA takes much longer than doing the same thing from a regular hard disk or solid-state disk drive.
But that’s not to say there is not a use for DNA storage, he added. DNA storage could prove to be perfect for someone who needs to store data for long periods of time and access it infrequently–and by long periods, we are talking dozens, if not hundreds of years. Because DNA are stable for millennia, you can store data on them practically forever. The fact that IDT makes enough material to be reused many times may also drive down costs. And because it is nature which provides the data reader, not BestBuy, customers won’t need the equivalent of a crazy old disc reader next to their desktops hundreds of years in the future to get to it.
A breakthrough in data storage: archival photos of Marlon Brando
If DNA data storage ever does go mainstream, you might have Marlon Brando to thank. Or at least, Olgica Milenkovic’s admiration for him.
Dr. Milenkovic, a professor of electrical and computer engineering at the University of Illinois at Urbana-Champaign, is one of the pioneers of the effort to store data on DNA. Her first paper, in 2015, used IDT’s gBlocks Gene Fragments to demonstrate random access and information rewriting on a DNA-based storage system. Her second used gBlocks and an Oxford Nanopore Sequencer 7.0 to encode text-oriented files from Wikipedia, then extended that to images.
Brando made DNA storage history with Milenkovic’s third paper. In that study, her team used oPools to encode eight Brando images in a more cost-effective approach to storage.
“I was very biased,” she said of the decision to use Brando. “I always loved his movies.”
“We had some errors when we synthesized the oPools,” said Pan. “We removed most of the errors in the oPools without resorting to costly coding redundancy. Instead, we used machine learning methods such as discoloration detection and image inpainting via neural networks.”
‘With oPools, you get a sense that every oligo is there’
Jeff Manthey, IDT’s senior bioinformatics developer, began working with IDT founder Dr. Joseph Walder on oPools DNA data storage in 2016.
“Joe had the idea of how to leverage IDT’s ability to make large quantities of high-quality oligos and then, through a process of reuse, fabricate the distinct constructs that make up bits of info into DNA fragments,” he said. “One of the problems we ran into was when dealing with large amounts of oligos at low quantities, the ability to effectively aliquot and dispense is more difficult, and it is harder to ensure you are getting adequate coverage across every oligo that might be present.”
The work at IDT showed that oPools could be effectively used to long-term store a small document without the need to worry about corruption.
“With oPools, you get a sense that every oligo is there—you use oPools if you want the highest quality assurance,” he said. In contrast, with cheaper oligos, it is practically necessary to verify that each document is encoded correctly and present without any errors.
Another issue, said Manthey, is eventual decoding. There is not yet a standard, so any encoders must retain their strategy and do so outside of DNA storage.
“The code may not be in a form that can be easily run,” he said. “For example, try installing a computer game from the mid-90s on a modern computer or laptop. And as companies are bought out or merged, the data stored in DNA along with the means to decode needs to survive those transitions. Unless encoding/decoding standards are established for the act of writing and reading DNA formatted data, that risk will remain.”
In the summer of 2019, Park’s Catalog announced that it had stuffed all 16GB of Wikipedia onto DNA strands which equaled about a drop of liquid. Catalog’s DNA writing machine now writes data at 4 megabits per second, though the company hopes to speed that up by at least 1,000 times. The unceasing accumulation of data, combined with the market for conventional DNA sequencing products, means the cost of DNA data storage should drop.
The Catalog breakthrough shows that DNA data storage technology is progressing–to a point. For now, hold on to your thumb drives and cloud storage account.
“We are getting closer,” said Clore, “but there are still some big hurdles that need to be figured out.”