Scientists have learned that dna can tell us a great deal about our risk factors for certain diseases. Since the sequencing of the genome was completed in 2003, they have been able to figure out which mutations will make us vulnerable to, say, cancer or hair loss. As computing power has increased, they have been able to sequence the 19,599 genes in individual patients to provide a DNA map that helps drug companies target specific mutations. Herceptin, for example, targets patients who have what are known as HER2-positive genes, which increase the output of growth factors that drive breast cancer.
Although genetic information continues to improve medicine, there is data about the human body that goes one step beyond the genome. Genes by themselves don’t directly make us who we are. Instead, they produce proteins, which are dispatched into the body to execute the genetic will. If the genes are the blueprints, the proteins are the working parts, controlling every cell in your body. And just as the genes collectively make up the genome and have given rise to the science of genomics, so too do all your body’s proteins make up your proteome, which has its corresponding discipline: proteomics.
It’s a far more complex field than genomics, studying how proteins are structured and expressed, how they change and communicate. When you tie genome sequencing to proteome sequencing, it adds billions of data points across millions of patients. That’s both good and bad.
With a fire hose of information that big, you can develop better drugs and look for better biomarkers: anything in a patient’s blood, urine or saliva—from proteins to enzymes to red-cell count—that indicates the presence of a disease. But fire hoses are hard to handle, and that’s where Big Data comes in.
The combination of massive computer power and sophisticated algorithms that can manage staggeringly complex problems—from predicting precisely where a tornado will touch down to making your Web search more efficient—is the next great wave of data processing. Companies like Roche, Illumina, Life Technologies, Pronota and Proteome Sciences are expanding their bioinformatics platforms to develop new diagnostics and new drugs based on them. Sometimes a diagnostic and a drug are developed in tandem, a model known as Dx/Rx. These new proteomic-derived agents are designed to target everything from sepsis to Alzheimer’s disease to cancer and offer the opportunity to deliver bespoke medicine, tailored to your molecular structure. It’s Savile Row biology.
(MORE: Breakthrough Stories)
Proteomics is, in some ways, a massive pattern-matching process. It works like this: Take 100 people who have lung cancer and 100 people who don’t. What is the difference in their genomic and proteomic profiles? Identify the specific proteins that signal the cancer cells to grow and those pathways can be switched off with targeted drugs. If you are one of the unlucky 100 who have lung cancer, this kind of Big Data crunching can let doctors search proteomic data, compare it against your genome, which has all your personal mutations, and create a treatment map. The same thing will go for each of the other 99, all of whom have the same disease as you but all of whom might have arrived there by a slightly different genomic and proteomic route. “This global profiling of signal pathways will transform how we deal with cancer,” says Ian Pike, chief operating officer of Proteome Sciences, a 20-year-old company based in Cobham, England.
But that kind of data crunching plays out on a scale that makes the genome project seem like a math quiz. “What people haven’t appreciated is that the genome is not so dynamic. It tells you your likelihood of getting disease, not whether you actually have it,” says Christopher Pearce, Proteome Sciences’ CEO. You are born with one set of genes, in other words. Proteins are in a constant state of flux.
Laying Down a Marker
The field in proteomics currently attracting a lot of investment is biomarkers, which can predict with greater accuracy who is susceptible to a particular disease, help doctors diagnose and treat it earlier and track whether those treatments are working. “The next 10 years will dwarf the previous 60” in terms of what advanced sequencing can produce, says Ronnie Andrews, head of medical sciences at Life Technologies, which designs bioinformatics software platforms.
The market for biomarkers alone was about $13.5 billion in 2010, according to BCC Research, and could surpass $33 billion by 2015. Life Technologies, located in Carlsbad, Calif., is a $3.8 billion company that supplies scientists with instrumentation for gene synthesis, cell lines and more for use in genomic medicine and molecular diagnostics. Its proteomics portfolio helped make it attractive to Thermo Fisher Scientific, which is acquiring the company for $13.6 billion.
What makes proteins such a huge data challenge is that they exist in a nearly infinite variety of combinations. Consider that those 19,599 genes in the human body can, in turn, produce some 200,000 types of RNA. Each RNA strand then encodes up to 200,000 proteins. That’s a geometric expansion that gives you 40 billion proteins and helps explain why you need that big in Big Data.
And then it gets really complex. There can be hundreds of different variants of proteins in a single cell. Biotech firms and researchers use mass spectrometers to identify pieces of proteins. But as sophisticated as the spectrometers are, their dynamic range—or capacity to collect and read different protein concentrations in a sample—is exceeded by the proteins’ complexity by a factor of three or four. “The dynamic range of proteins in a cell is huge,” says Dan Rhodes, head of medical bioinformatics for Life Technologies.
Finding the right protein for your needs has been likened to spotting a bee from space, says Katleen Verleysen, a scientist and the CEO of Pronota, a Ghent, Belgium, proteomics start-up. “It’s a huge challenge, to be quite honest,” she says. “Most protein biomarkers out there were found by accident. There are very limited examples of companies that said, ‘We want to take serendipity out of the equation.’”
Pronota is one of them and has $36.5 million in venture funding, including some from Johnson & Johnson’s VC arm; it is heading for a second funding round. Pronota has four tests in the pipeline, including one that would quickly tell an emergency-room doctor whether a patient who is short of breath is suffering from a heart attack or a lung infection.
Whether or not proteomics yields a big win for investors, it is already expanding the immense potential of cloud computing. “Proteomics creates so much data, you have to have massive computational resources,” says Fintan Steele, a biochemist and spokesman for SomaLogic, in Boulder, Colo. Steele’s company is trying to leapfrog current mass spectrometry to vastly expand the number of biomarkers. “What drives it is the cloud,” he says. “There’s not an easy way to do it if you don’t have the ability to suck the data down.”
And how. If mapping the genome consumed terabytes, proteomics will easily reach into petabytes. (That’s trillions vs. quadrillions of bytes.) Life Technologies recently used cloud computing to analyze RNA sequencing data from 5,000 patients. The process of sequencing that RNA generated about 20 terabytes of data—an estimated 15 computer years of work if your desktop machine were trying to crunch it all. Life Technologies did it in a week. “We envision our sequencing technology being widely deployed,” says Rhodes. “Imagine hundreds of thousands or millions of patients. We are at the very beginning.”
Although the price of sequencing is still relatively high, it is falling. That should tip the cost-benefit analysis in favor of investments in proteomics. The U.S. spends $70 billion on oncology drugs annually and wastes about 40% of it because the drugs don’t work. If it instead spent $5 billion to map those patients, says Andrews, perhaps $25 billion could have been saved by not treating them with the wrong drugs.
By itself, beefing up brute computing force is not enough to understand the complex pathways of proteins. The most significant problem is sifting through the deafening amount of protein noise in each cell to find the signals that are important. The bad proteins often represent a small fragment of a cell’s total. “At the end of the day you have to use complex statistical models to identify which changes in proteins drive the separation of the group that has the disease and doesn’t have the disease,” Andrews says. “There are still bottlenecks in processing power.”
And some progress. Pronota plans to introduce tests for early diagnosis of sepsis, which the company says costs the U.S. $16 billion annually in addition to causing 225,000 deaths. That one diagnostic tool could generate sales of $1.5 billion worldwide, Verleysen says. Proteome Sciences recently created a panel for 16 protein biomarkers in Alzheimer’s disease patients to monitor the progression of the disease.
Ultimately, this kind of individualized medicine based on your personal proteomic and genomic map could lead to what SomaLogic calls a wellness chip, something that could monitor changes in your protein makeup over time by analyzing a drop of blood. All these promising technologies will depend on science’s ability to understand the data our bodies are capable of producing.
For hundreds of years, progress in medicine has been restrained by the lack of information. Today, the problem is almost the reverse. The torrent of information released at the molecular level is beyond the means of medical science to analyze. “Stated simply, what we are trying to do is turn data into useful, actionable knowledge,” says Rhodes of Life Technologies. If only it were really that simple.