On Monday, Google released a tool called DeepVariant that uses deep learning—the machine learning technique that now dominates AI—to assemble full human genomes.
And now, engineers at Google Brain and Verily (Alphabet’s life sciences spin-off) have taught one to take raw sequencing data and line up the billions of As, Ts, Cs, and Gs that make you you.
Today, you can get your whole genome for just $1,000 (quite a steal compared to the $1.5 million it cost to sequence James Watson’s in 2008).
But the data produced by today’s machines still only produce incomplete, patchy, and glitch-riddled genomes. Errors can get introduced at each step of the process, and that makes it difficult for scientists to distinguish the natural mutations that make you you from random artifacts, especially in repetitive sections of a genome.
See, most modern sequencing technologies work by taking a sample of your DNA, chopping it up into millions of short snippets, and then using fluorescently-tagged nucleotides to produce reads—the list of As, Ts, Cs, and Gs that correspond to each snippet. Then those millions of reads have to be grouped into abutting sequences and aligned with a reference genome.
That’s the part that gives scientists so much trouble. Assembling those fragments into a usable approximation of the actual genome is still one of the biggest rate-limiting steps for genetics.
DeepVariant works by transforming the task of variant calling—figuring out which base pairs actually belong to you and not to an error or other processing artifact—into an image classification problem. It takes layers of data and turns them into channels, like the colors on your television set.
After the FDA contest they transitioned the model to TensorFlow, Google’s artificial intelligence engine, and continued tweaking its parameters by changing the three compressed data channels into seven raw data channels. That allowed them to reduce the error rate by a further 50 percent. In an independent analysis conducted this week by genomics computing platform, DNAnexus, DeepVariant vastly outperformed GATK, Freebayes, and Samtools, sometimes reducing errors by as much as 10-fold.
DeepVariant is now open source and available here: https://github.com/google/deepvariant
Google competes with many other vendors on many fronts. But while his competitors are focused on battling for today’s market opportunities, Google is busy in a solitary race to control the battlefield of the future: the human body.
The human body is the ultimate data center.