What makes proteins such a huge data challenge is that they exist in a nearly infinite variety of combinations. Consider that those 19,599 genes in the human body can, in turn, produce some 200,000 types of RNA. Each RNA strand then encodes up to 200,000 proteins. That’s a geometric expansion that gives you 40 billion proteins and helps explain why you need that big in Big Data.
And then it gets really complex. There can be hundreds of different variants of proteins in a single cell. Biotech firms and researchers use mass spectrometers to identify pieces of proteins. But as sophisticated as the spectrometers are, their dynamic range—or capacity to collect and read different protein concentrations in a sample—is exceeded by the proteins’ complexity by a factor of three or four. “The dynamic range of proteins in a cell is huge,” says Dan Rhodes, head of medical bioinformatics for Life Technologies.
Finding the right protein for your needs has been likened to spotting a bee from space, says Katleen Verleysen, a scientist and the CEO of Pronota, a Ghent, Belgium, proteomics start-up. “It’s a huge challenge, to be quite honest,” she says. “Most protein biomarkers out there were found by accident. There are very limited examples of companies that said, ‘We want to take serendipity out of the equation.’”
Pronota is one of them and has $36.5 million in venture funding, including some from Johnson & Johnson’s VC arm; it is heading for a second funding round. Pronota has four tests in the pipeline, including one that would quickly tell an emergency-room doctor whether a patient who is short of breath is suffering from a heart attack or a lung infection.
Whether or not proteomics yields a big win for investors, it is already expanding the immense potential of cloud computing. “Proteomics creates so much data, you have to have massive computational resources,” says Fintan Steele, a biochemist and spokesman for SomaLogic, in Boulder, Colo. Steele’s company is trying to leapfrog current mass spectrometry to vastly expand the number of biomarkers. “What drives it is the cloud,” he says. “There’s not an easy way to do it if you don’t have the ability to suck the data down.”
And how. If mapping the genome consumed terabytes, proteomics will easily reach into petabytes. (That’s trillions vs. quadrillions of bytes.) Life Technologies recently used cloud computing to analyze RNA sequencing data from 5,000 patients. The process of sequencing that RNA generated about 20 terabytes of data—an estimated 15 computer years of work if your desktop machine were trying to crunch it all. Life Technologies did it in a week. “We envision our sequencing technology being widely deployed,” says Rhodes. “Imagine hundreds of thousands or millions of patients. We are at the very beginning.”
Although the price of sequencing is still relatively high, it is falling. That should tip the cost-benefit analysis in favor of investments in proteomics. The U.S. spends $70 billion on oncology drugs annually and wastes about 40% of it because the drugs don’t work. If it instead spent $5 billion to map those patients, says Andrews, perhaps $25 billion could have been saved by not treating them with the wrong drugs.
By itself, beefing up brute computing force is not enough to understand the complex pathways of proteins. The most significant problem is sifting through the deafening amount of protein noise in each cell to find the signals that are important. The bad proteins often represent a small fragment of a cell’s total. “At the end of the day you have to use complex statistical models to identify which changes in proteins drive the separation of the group that has the disease and doesn’t have the disease,” Andrews says. “There are still bottlenecks in processing power.”
And some progress. Pronota plans to introduce tests for early diagnosis of sepsis, which the company says costs the U.S. $16 billion annually in addition to causing 225,000 deaths. That one diagnostic tool could generate sales of $1.5 billion worldwide, Verleysen says. Proteome Sciences recently created a panel for 16 protein biomarkers in Alzheimer’s disease patients to monitor the progression of the disease.
Ultimately, this kind of individualized medicine based on your personal proteomic and genomic map could lead to what SomaLogic calls a wellness chip, something that could monitor changes in your protein makeup over time by analyzing a drop of blood. All these promising technologies will depend on science’s ability to understand the data our bodies are capable of producing.
For hundreds of years, progress in medicine has been restrained by the lack of information. Today, the problem is almost the reverse. The torrent of information released at the molecular level is beyond the means of medical science to analyze. “Stated simply, what we are trying to do is turn data into useful, actionable knowledge,” says Rhodes of Life Technologies. If only it were really that simple.