|
|
||||||||
Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois
Correspondence and requests for reprints should be addressed to Yves A. Lussier, M.D., Section of Genetic Medicine, Department of Medicine, University of Chicago, 5801 South Ellis Avenue, Chicago, IL 60637. E-mail: lussier{at}uchicago.edu
ABSTRACT
The recent completion of the Human Genome Project has made possible a high-throughput "systems approach" for accelerating the elucidation of molecular underpinnings of human diseases, and subsequent derivation of molecular-based strategies to more effectively prevent, diagnose, and treat these diseases. Although altered phenotypes are among the most reliable manifestations of altered gene functions, research using systematic analysis of phenotype relationships to study human biology is still in its infancy. This article focuses on the emerging field of high-throughput phenotyping (HTP) phenomics research, which aims to capitalize on novel high-throughput computation and informatics technology developments to derive genomewide molecular networks of genotypephenotype associations, or "phenomic associations." The HTP phenomics research field faces the challenge of technological research and development to generate novel tools in computation and informatics that will allow researchers to amass, access, integrate, organize, and manage phenotypic databases across species and enable genomewide analysis to associate phenotypic information with genomic data at different scales of biology. Key state-of-the-art technological advancements critical for HTP phenomics research are covered in this review. In particular, we highlight the power of computational approaches to conduct large-scale phenomics studies.
Key Words: computational genomics genedisease associations phenomics phenotype
| GLOSSARY
|
This article focuses on the emerging field of phenomics, which aims to capitalize on novel high-throughput computation and informatics technologies to derive genomewide molecular networks of genotypephenotype associations, or "phenomic associations." Currently, such large-scale high-throughput phenotyping (HTP) phenomic studies are limited due to our lack of knowledge about the relationships between molecular-level genotypes and their organism-level phenotypic manifestations.
To address this challenge in HTP phenomics research, several technological advancements will be critical in enabling the collection, organization, and computable encoding of large-scale, high-throughput phenotypes, and will be discussed in this review. In this article, we first provide a detailed analysis of the challenges facing HTP phenomics research, followed by an introduction of the current state of high-throughput phenotypic data collection (1), representation and encoding of phenotypes for computation, development of phenomic databases, and genomewide HTP phenomic analyses. In this last section, WHOLE GENOME HTP PHENOMIC ANALYSES, we will also explore the feasibility of using computational phenomics approaches to enhance our understanding of genotypephenotype relations and networks across different biological scales, from molecular biology to systems medicine.
CURRENT CHALLENGES FOR HTP PHENOMICS
One of the main factors hindering the progress of phenotypic discovery research is the limited accurate and timely access to comprehensive genephenotype networks associated with knowledge about biology and diseases. There are several obstacles restricting such access, as discussed in the following sections.
Lack of Understanding of GenePhenotype Relationships
In the emerging field of phenomics, the pace of developing computable phenotypic databases and deriving networks of relationships among phenotypes and genes for use in constructing genotypephenotype databases trails behind the rapid evolution of genomic databases. Currently, although many genomic databases of model organisms contain some phenotypic information, phenotypes are often coded at different levels of granularity, in different formats, and with different aims. In this case, we refer to "granularity" as the level of detail by which phenotypes are defined (e.g., "chronic obstructive lung disease" is less detailed than "centriacinar emphysema"). For example, PhenomicDB (2) allows only comparative genomic studies containing limited queries of textual (uncoded) phenotypic information associated with genes of interest. In contrast, state-of-the-art phenome-oriented methods require organization and encoding of phenotypes to genes before conducting combined genotypic/phenotypic analyses. However, most of such phenotypic databases are manually curated, and are thus limited in their breadth for high-throughput computing. Although high-throughput genotypephenotype analyses were permitted via mining the wealth of scientific literature, such efforts yielded limited success due to the lack of expressiveness and granularity of text mining technology. To overcome these obstacles in developing phenotypic databases, our research group developed PhenoGO, a large-scale, ontology-anchored genephenotype network that we engineered and optimized for integration, classification, and analysis of well-encoded phenotypes. As shown in Figure 1, PhenoGO currently has the largest collection of relationship networks among phenotypes, genes, and the Gene Ontology (GO).
|
Scarcity of Phenotypic Discovery Methods, Theories, and Predictions
There is a scarcity of phenotypic discovery methods, theories, and predictions to exploit the rich and untapped phenotypic data repositories in current genetic model organism databases and, soon, the databases of the National Institutes of Health (NIH) "Whole Genome Association" studies.
HIGH-THROUGHPUT COLLECTION OF PHENOTYPES
Over the past few years, several advances using experimental or imaging methods have made it possible to gather phenotypic information from different organisms in a high-throughput fashion. However, genephenotype analyses are currently limited to quantitative trait loci (QTL) studies requiring carefully curated pedigrees of individuals. For example, to map large-scale QTL to phenotypes, Solberg and colleagues (4) developed a protocol to collect multiple phenotypic measurements for high-throughput parallel phenotyping in populations of mice, and significantly reduced the high cost of genotyping in relation to the amount of information that can be derived from each phenotypic measurement. This protocol led to the detection of statistically significant variations among several inbred strains of mice from a population of over 2,500. However, because this method relies heavily on pedigree, it cannot be readily applied to clinical records and genetic databases because the pedigree associated with phenotypes is often absent. In other arenas, advances in imaging technologies, such as preclinical magnetic resonance imaging, have facilitated high-throughput phenotype imaging and reduced both the financial cost and time to characterize each individual animal (5). Similarly, advances in microcomputed tomographic scanning technology have brought down the expense of high-precision imaging. This technology has been applied to "virtual histology," saving both the time and cost of phenotyping murine embryos while retaining image fidelity (6). In addition, genome-scale RNAi screens have been widely used in invertebrate systems for cellular-level phenotyping, and are now increasingly applied to more complex organisms (7).
REPRESENTATION OF PHENOTYPES FOR HIGH-THROUGHPUT ANALYSES
Although technological advancements have accelerated the pace of collecting phenotypic data, the task of coding and interpreting the output of high-throughput data collection is still left largely to humans, a labor-intensive and rate-limiting process in establishing phenotypic databases. Image-processing technologies, such as those used to automatically analyze imaging data from zebrafish (8), will play an increasingly important role in automating the evaluation and quantification of the massive amounts of phenotypic data. However, automated and accurate encoding and integration of heterogeneous phenotypic data remains challenging.
Applications of ontologies are now becoming a prevalent topic in the biomedical informatics field, largely due to the successful launch of GO. Scientists have invested considerable effort in establishing standards for the integration of phenotypes using ontologies. Since the launch of GO, a number of other ontology-based databases have been developed and have demonstrated the power of ontologies as the best standards for accelerating the data integration and analysis processes of biological and genomic data, which generally use unconstrained text and are too complicated to interpret. GO (9), which has been very successful in annotating genes with molecular functions, processes, and cellular locations, provides a good resource for the association of genes with cellular phenotypes. The Cell Type Ontology (CTO) (10) includes over 680 cell types covering the prokaryotic, fungal, animal, and plant worlds (11). The Mouse Genome Informatics (MGI) databases (12) and Rat Genome Database (13) contain genes, phenotypes coded in the Mammalian Phenotype Ontology (MPO) (12), unstructured phenotypic narratives, and references to PubMED. In the clinical domain, the Systematized Nomenclature of Medicine (SNOMED) (14), which is part of the Unified Medical Language System (UMLS), contains over a half million clinical concepts, such as disease, anatomy, morphology, functions, drugs, procedures, and treatments. To provide a unified framework for representing attributes of phenotypes requiring the composition of more than one code in any given ontology, the GO consortium has also initiated the development of the Phenotype Attribute Ontology (15) to reduce the structural barriers that limit the reuse of phenotypic databases. GO, SNOMED, CTO, and MPO are arranged as directed acyclic graphs (16), a data structure that allows standardized computational methods to process data in high throughput. These foundational ontology initiatives in both the biological and medical communities have set the stage for increasing the productivity of phenotypic research. However, many phenotypes stored in model organism databases remain buried in narratives or coded in terminologies specific to a community that are not cross-indexed with widespread standards.
In addition to the challenges associated with experimental methods for gathering phenotypic information are those associated with automatically encoding phenotypes collected in heterogeneous, unstructured forms. Although there has been a recent growth in text-mining research geared toward capturing genephenotype relationships from the literature (1, 1721), it has failed to provide deep semantic and nested levels of associations from which ternary or higher order relations (e.g., a cell-typedependent specific gene function) across concepts can be derived. Alternatively, some natural language processing (NLP) techniques can provide a deeper level of semantic relationship and a nested level of associations across concepts, allowing for more sophisticated computational studies. The Medical Language Extraction and Encoding NLP system (MedLEE), developed by Friedman and colleagues (22), was the first and most general NLP system, shown to be as accurate as clinicians in extracting phenotypic information from clinical reports (24). It has been evaluated in many different fields of clinical medicine, as evidenced by results of numerous independent evaluations (2328). NLP systems are generally designed to extract phenotypes, but not to encode them in ontologies. Friedman and colleagues (29) and Tulipano and colleagues (30) have extended the capabilities of MedLEE to accurately encode phenotypes from clinical and imaging reports in comprehensive terminologies, such as the UMLS and SNOMED. Other NLP systems have also been shown to be robust but restricted in the specific task of extracting phenotypes from medical records (3032). Although a few commercial NLP systems are currently available, to our knowledge they are incapable of encoding concepts from narratives in clinical reports. Rather, they classify concepts into simple classifications such as International Classification of Diseases Clinical Modification (ICD-9-CM), containing about 15,000 diseases (33). In contrast to clinical and imaging narratives, co-occurrencebased text-mining systems abound for mining the scientific literature, as reviewed by Jensen (34). However, they do not encode in terminologies, and thus generally are useful only for specifically designed studies and are not reusable in more general settings. Lussier and colleagues (35) and Friedman and colleagues (36) have recently completed BioMedLEE, the first NLP system for coding phenotypes in the scientific literature, which also allows for mining semantic relationships between genes and phenotypes that could not be captured by co-occurrencebased or statistics-based text-mining systems. BioMedLEE was successfully applied in high throughput over thousands of scientific abstracts and amassed the largest collection of genephenotype associations in PhenoGO (35), which will be described further in the following section.
Representing and encoding phenotypes in ontologies is an essential, yet insufficient step for automating the integration of coded phenotypes across heterogeneous databases. Indeed, many different terminologies and ontologies offer overlapping representations, sometimes at different levels of granularity. Cimino and Barnett first conceived lexical methods for creating translation tables across heterogeneous medical terminologies (37). Others thereafter have incrementally improved these techniques (3749). For example, the UMLS has an extensive number of related tools such as MetaMap (MMTx), which can map terms to concepts in the UMLS Metathesaurus (38). Lussier and Li (50) and Sarkar and colleagues (51) pioneered the automated translation between heterogeneous phenotypic terminologies. An alternative approach to integrating phenotypes across terminologies is to rely on large-scale metathesauri designed specifically for that purpose, such as the NIH UMLS (52), or the National Cancer Institute's Metathesaurus (53), which includes hundreds of distinct biomedical terminologies that have been semiautomatically mapped to one another. Although the automated terminology integration approaches are limited in accuracy, they are scalable to any pair of terminologies and can be conducted in real time. In contrast, the metathesauri are more accurate but are rate-limited due to the many terminologies that have not yet been integrated, and perhaps more important, because the mappings may be out of synchronization with newer versions of the source terminologies.
In summary, it is noteworthy that automated coding and harmonization technologies are not widely available and remain the panacea of bioinformatics networks and research groups, whereas the metathesauri are freely available. Comprehensive dissemination of technologies and training will be required in the future for phenotypic datasets to be computer processable in real time (54).
DEVELOPMENT OF PHENOTYPIC DATABASES
In the process of associating phenotypes with genes, data integration plays a key role in correlating heterogeneous phenotypic data with genomic data at different scales. The current efforts to organize phenotypic information for high-throughput phenomics studies focus on both manual and computational methods for gathering phenotypes and their related genomic information. Both methods have their distinctive advantages and disadvantages. Although manual methods provide more accurate genephenotype relations, they are more time and labor consuming. In contrast, computational methods are able to generate large networks of genephenotype relationships in a relatively short amount of time, but generally lack accuracy when compared with the results of manual methods.
Manually Curated Databases
There are several databases that contain manually curated phenotypic information, including OMIM (3), the Online Mendelian Inheritance in Animals (OMIA) (55), and all model organism databases. The OMIM and OMIA databases contain unstructured phenotypic narratives and references to PubMED, from which it is computationally difficult to extract coded phenotypic data. Similarly, although many genomic databases of model organisms contain some phenotypic information, phenotypes are often coded at different levels of granularity, in different formats, and with different aims (56). Realizing the difficulties of using phenotypic narratives in organizing phenotypic information, the MGI database (12) chose to use coded and computable phenotypes in the MPO, as described above, to organize phenotypes in different mouse strains (12). Although phenotypic narratives can be more nuanced and detailed, coded phenotypes are classified in the MPO and are readily computable. The contents of these different databases are summarized in Figure 1, in which we present the quantity of distinct phenotypes (breadth) and the quantity of genephenotype associations (depth) for OMIM and MGI. Of the curated databases, MGI remains the best-organized database with the most variety of coded phenotypes and coded binary and ternary relationships (Figure 2).
|
In contrast, Lussier and colleagues used NLP over the scientific literature combined with the GO database to amass and encode phenotypes in high throughput (35). The resulting database, PhenoGO (http://www.PhenoGO.org), contains the largest number of genephenotype associations (Figure 1), and provides the broadest variety of binary and ternary relationships between genes, GO concepts, and phenotypes (Figure 2 and Table 1). The PhenoGO database also differs from other genephenotype databases in that it also provides ternary relationships, such as biological process of a specific gene in a particular phenotypic context. For example, the PhenoGO database refines GO concepts though the assignment of phenotypic information, such as the cell type, tissue, and organ to GOgene annotations. The addition of such phenotypic context to gene expression information could be a crucial step for understanding the development and the molecular underpinnings of the pathophysiology of diseases, as not all potential biological processes associated to a gene are possible in every cell type. Currently, PhenoGO consists of 532,406 phenotypeGO relations, with 33,224 distinct genes in 10 species, 5,680 unique GO concepts and 4,650 unique phenotypes coded in SNOMED, MPO, CTO, and UMLS. Manual evaluation of a random sample of geneGO(phenotype or disease) relationships revealed a precision (positive predictive value) of 85% (95% confidence interval [CI], 8289%) and a recall of 76% (95% CI, 6983%). To our knowledge, this is the first system that offers a level of precision not too far from that of manual curation.
|
95%). Term co-occurrence and statistical NLP generally produce up to 75% precision, whereas semantic NLP, such as BioMedLEE and MedLEE, can reach above 85% precision. To illustrate the differences and similarity between these genomephenome networks, Table 2 provides an example of the subset of a manually curated network (MGI) and additional computed phenotypes (found in the PhenoGO database).
|
Text-based HTP phenomics is designed to predict genedisease associations; however, its methods vary broadly. To overcome the limitations of manual annotation to create phenotypic datasets, others in the field conducted high-throughput phenotypegenotype analyses by mining text on phenotypegenotype relationships from the scientific literature (57, 58), with limitations of text mining as described above. Korbel and colleagues conducted an analysis that combined data mining of the MEDLINE abstracts to extract terms of prokaryotic traits, and comparative genome analysis to identify association of phenotype to genotype relationships (57). Approximately 2,700 significant genetrait associations were identified. Gene2Disease was constructed over OMIM using text-mining methods coupled with analysis of the chromosomal locations of diseases (62). However, in these two systems, the integration of phenotypes relies on the juxtaposition of the original lexical string of text in the same field across species. Thus, a textual search for a concept may miss synonyms, as well as related or subsumed concepts. Although these literature-based approaches allow scientists to browse phenotypes and their associated genes and to conduct comparative genomics analyses among different organisms, their analyses are merely functional genomics studies constrained to datasets organized according to textual terms containing phenotypic information. In addition, the resultant binary textual relationships lack context.
Lussier and coworkers pioneered ontology-anchored HTP phenomics in clinical databases. They integrated the Quick Medical Reference (QMR) with OMIM, from which relationships among genes, diseases, and traits of diseases were generated. Clustering of genes with traits of diseases demonstrated classification of diseases according to genes (63) and enabled association studies of environmental factors, such as drug intake and smoking, found in QMR with genes found in OMIM. This study was followed up with the GenesTrace method, a large-scale integrative study of ontology-anchored phenotypes from the UMLS and their statistical and semantic relationships to GO and model organism databases (64). We were able to infer approximately 3 million phenotypegene associations among 22,040 phenotypic concepts in the UMLS and 16,894 gene products annotated using GO and its associated databases (64). Inferences were validated by comparing them to known genedisease relationships, as defined in OMIM's Morbidmap. Approximately 30% of the predictions could be found in OMIM, and conversely, 9% of OMIM's relationships were found in Genestrace (64). In addition, our methods provided direct links to clinically significant diseases through established terminologies or ontologies. These observations demonstrate the significance of exploiting the existing manually curated relationships in biomedical resources as a tool for the discovery of potentially valuable new genedisease relationships. Recently, Butte and Kohane (65) conducted the first ontology-anchored HTP phenomics study with phenotypically related concepts in UMLS (66) and microarray gene expression data from the NCBI's Gene Expression Omnibus (67) using a term presence/absence method. Significantly expressed genes above a threshold were correlated with UMLS phenotypic concepts via a resampling-based multiple testing simulation generating 64,003 relations among 281 biomedical concepts and 7,466 genes (65). This study provided an HTP phenomic method for identifying genes related to phenotype and environment.
Although HTP phenomics is in its early stages, there is sufficient evidence through validations that it is promising. In 2001, Jimenez-Sanchez and colleagues established a proof-of-concept study for HTP phenomics by manually relating about 1,000 disease-related genes to their molecular functions and observed that the frequency distribution of lethality of genes according to their molecular function recapitulates current knowledge about these molecular families (68). Since that proof of concept, GenesTrace provided additional evidence that integrating and systematically analyzing genomephenome networks can accurately predict disease genes. More precisely, the GenesTrace study was based on patterns of GO annotations of genes (9) with the UMLS clinical knowledge base (66). Using the 1,407 single gene diseases of the OMIM (69) dataset as a control, GenesTrace predicted 124 distinct genes in the context of being related to their specific disease concept, and 290 distinct genes were erroneously associated with concepts, for a precision of 30% and recall of 8.8% (64). Kohane and Butte also merged the UMLS knowledge base, this time with microarray datasets, and accurately predicted novel findings corroborated in a new microarray study (65). Recently, Aertz and colleagues predicted gene phenotypes through a fusion of a large amount of heterogeneous genetic and clinical knowledge bases, including text mining of the literature (70). This technique, called "Endeavor" data fusion, identified a novel gene involved in craniofacial development and likely with DiGeorge-like birth defects. This prediction was further corroborated in zebrafish embryos that showed an underdeveloped lower jaw. The properties of these studies are summarized in Table 1 and their respective dataset size in Figure 1.
FUTURE CHALLENGES
HTP phenomics research faces the challenge of technological research and development to generate novel tools in computation and informatics to amass, access, integrate, organize, and manage phenotypic databases across species and enable genomewide analysis to associate phenotypic information with genomic data at different scales of biology. Currently, the lack of high-throughput technologies to access well-networked and integrated phenotypes from heterogeneous sources and across multiple scales of biology has prevented the effective usage of phenotypic information. Therefore, HTP phenomics research that aims to unlock genedisease relationships will play a key role in a "systems approach" to molecular medicine and individualized medicine. In this review, we highlighted the state of the art in computational approaches to conduct large-scale phenomics studies. Among various strategies that could facilitate computational phenomics, ontologies have proved to be particularly effective at integrating and organizing a large number of phenotypic concepts on a computable platform. The success of GO underscores the importance of ontologies in phenotypic research. Similarly, the NLP techniques have increasingly shown their unique and efficient capacity to associate genes with the narrative phenotypic descriptions in the literature, which are often unconstrained and unstructured and could not be otherwise handled by other technologies. Although there are novel computational approaches proposed for conducting high-throughput association analysis, they generally lack a common benchmark for comparison, thus often yielding results that are difficult to compare. Because the NIH recently recognized the urgent need for a well-organized resource of human phenotypes and diseases, it launched Whole Genome Association studies. The Whole Genome Association will link genetic data with the rich phenotypic datasets of large-scale clinical studies accumulated over several generations of patients to generate large-scale common sharable datasets. Such unified efforts will accelerate the process of identifying the genetic and environmental factors associated with human disease. It will also provide a framework to use HTP phenomics methods in conjunction with methods based on quantitative trait loci methods. The emerging field of HTP phenomics is likely to have a focus on therapeutic predictions and delineate genedisease associations.
ACKNOWLEDGMENTS
The authors thank Lee Sam for his advice on improving the manuscript.
FOOTNOTES
Supported in part by NIH/NLM grants 1K22 LM008308-01, R01 LM007659, and by the National Center for the Multiscale Analysis of Genomic and Cellular Networks (U54CA121852-01A1).
Conflict of Interest Statement: Y.A.L., as a member of the Columbia University Center for Advanced Technology, was mandated by his Department chairman to provide scientific advice to the executive board of John Wiley and Sons. He has received no stipends or payments; however, he did receive a sponsored research contract described below. Wiley will not benefit directly from this paper; indirectly, however, this research is related to its interests since mining the scientific literature is part of Wiley's potential future markets. In 20042005, Y.A.L. received a research contract from John Wiley and Sons to conduct studies on "Ontology-anchored Methods for Computable Biomedical Excerpts." In the section of the manuscript pertaining to the Natural Language Processing and the text mining, he has not mentioned any publication that pertains to this contract, as he has not published our results yet. The research contract is completed and follow-up studies will help display clinical phenotypes in large clinical warehouses. Y.A.L. has a patent pending for computational terminology tools that the Columbia Center for Advanced Technology and the Columbia Science Venture groups are marketing. He is unaware of any companies currently interested; however, in the past, companies related to the mining of phenomic data were approached. He has received no money for these patents. Y.L. does not have a financial relationship with a commercial entity that has an interest in the subject of this manuscript.
(Received in original form July 15, 2006; accepted in final form August 21, 2006)
REFERENCES
This article has been cited by other articles:
![]() |
M. G. Kann Protein interactions and disease: computational approaches to uncover the etiology of diseases Brief Bioinform, September 1, 2007; 8(5): 333 - 346. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |