In January of 2005, Dr. Bumgarner and a large team of investigators from the UW submitted a $2.6M/year grant to the NIH National Centers for Biological Computation (NCBC) Program. This center is intended to bring together a group of computational scientists and biologists to create and test tools to infer biological pathways. While we won’t know if the Center is funded until late in 2005, this site is intended to provide a brief description of the center and updated links to research efforts of the assembled team.
A fundamental issue facing modern biology is the development of methods and tools to convert genomic and functional genomics data into mechanistic understanding. Genome sequences and computational methods have provided us with tools to identify and annotate genes and other functional sequences with varying degrees of accuracy. This sequence information in turn enables tools such as microarrays, 2-D gels, protein mass spectrometry and yeast 2-hybrid assays to measure RNA and protein levels and physical interactions on genome-wide or near genome-wide scales. Genome information has also enabled the development of high-throughput genetic mapping technologies that allow one to rapidly trace a given phenotype or trait to specific genetic regions. Empowered by wealth of available genome information and the new tools of function genomics, biologists in almost every field are now producing genome-wide data of a variety of types (transcriptomics, proteomics, deletion and knock-down screens).
However, in most cases there is still a major gap between the collection of large-scale genomic/genetic data and the inference of biological mechanism for a particular state, disease, or clinical outcome. This gap between large-scale data collection and biological understanding has become extremely apparent in the last several years during which an increasing number of microarray and proteomics based works have been published in which little contribution to new biological understanding has been made.
The proposed center seeks to bridge this gap by creating a series of publicly available algorithms, tools, databases and methodologies that will assist in the inference of biological pathways via the integration of researchers’ data sets with predicted physical interaction networks and functional annotations. In recent years, there has been an emerging paradigm in which it has become apparent that to obtain mechanistic understanding of large-scale expression and other genomics data, it is necessary to view this data in the context of an underlying model of the interactions between the biomolecules whose levels/properties are being measured and other biomolecules in the system that influence these levels/properties[1-6]. Hence, there is a growing need to gather, predict and represent genome-wide physical interaction networks (for many genomes) and to make these networks and associated analytical and visualization tools publicly available to a broad cross section of researchers. In addition, there are still a large number of genes whose function is completely unknown. Hence, tools to more accurately predict functional annotations on a genome wide scale are sorely needed for research to progress.
The UW "Center for Pathway Inference" will bring together researchers from Computer Science, several basic biology departments (Medicine, Microbiology, Lab Medicine, Genome Sciences and others), Statistics and Biostatistics and Biomedical Health Informatics to build a team of scientists and instructors who will not only create and distribute the software tools and databases that will enable "Systems Biology" but who are also committed to creating and training scientists capable of working in this emerging discipline. As required by the request for applications, the center will consist of two principle computational cores that interact tightly with several "driving biological projects". In addition, four other cores - Infrastructure, Training and Education, Dissemination and Administration will support the research and training mission of the center. Each of the Core activities is described briefly below.
This core, headed by Ram Samudrala (Assistant Professor, Microbiology), will focus on improving algorithms and methods for predicting protein-protein, protein-RNA and protein DNA interactions and for predicting gene functional annotations. A major activity of Core 1 will be to expand and improve on Dr. Samudrala’s pre-existing
"Bioverse" is a database and web server of predicted/measured functional, structural and interaction annotations for all genes in more than 50 genomes. At present, it is the only publicly available source of protein-protein interactions for the human genome. In addition, it is the only available source of predicted physical interaction networks for 45 of the other 49 genomes represented in Bioverse, including several versions of the rice genome and numerous pathogenic bacteria. The methods and underlying data structure of Bioverse will be modified to incorporate protein-RNA and protein-DNA interaction predictions and measurements and Core 1 gathered in Core 2 and the DBP’s.
Dr. Larry Ruzzo (Professor, Computer Science and Engineering, Co-Investigator), will focus the development of computational tools that assist in key steps in the discovery and characterization of non-coding RNAs (ncRNA’s) and on tools for sequence based RNA motif and RNA-protein interaction predictions. In the past several years, dramatic discoveries have greatly expanded both the number of known ncRNAs and the breadth of their biological roles, including significant regulatory and developmental roles. In short, ncRNAs are much more biologically significant than previously realized. Dr. Ruzzo’s goal to develop computational tools that assist in key steps in the discovery and characterization of ncRNAs and in motif prediction for RNA/protein binding. While there have been computational advances in this area, the state of the art in computational tools to describe and search for ncRNAs still lags considerably behind many other sequence analysis tasks. In particular, available models for ncRNA families are limited, and current tools for ncRNA model inference and sequence annotation are often challenging to use, inaccurate, infeasibly slow or all three. In addition, tools to accurately predict protein binging motifs in RNA need to be developed.
The overall goal of Core 1 is to produce improved tools and methods to accurately predict structures of, functions of and physical interactions between bio-molecules on a genome-wide scale and to represent those interactions in a web accessible database.
Roger Bumgarner, Associate Professor Microbiology, PI.
Co-Investigators Josh Akey, Assistant Professor Genome Sciences and Ka Yee Yeung, Research Assistant Professor, Microbiology
The expression informatics core will focus on:
- The development of tools and methods to analyze expression QTL data. We will develop open source software for the analysis of eQTL data, with an emphasis on methods applicable to outbred populations such as humans. The analysis of eQTL data presents unique computational challenges given the size and complexity of the experiments. However, we believe that with the appropriate methodological and statistical tools, the inherent complexity of eQTL data can be used to extract biologically meaningful information and provide important insights into the genetics of gene expression.
- Data mining of public data sources to supplement the physical interaction networks of core 1 with putative protein-DNA interactions derived from CHIP-to-chip data and eQTL/expression annotation of these putative interactions.
- The development of publicly accessible tools and algorithms to map "omics data" (expression array, proteomics, large-scale screening data, etc) onto the physical interaction networks generated in core 1.
- Tools to assist in the identification of relevant transcriptional modules and regulatory pathways from the integration researchers’ data with the networks in Bioverse and other sources of public data (e.g. related expression data).
Specifically, core 2 will focus on tools that will allow researchers to view their data in the context of the predicted interaction networks and other public data source. This data integration coupled with a web server will allow external researchers to generate testable transcriptional regulatory hypotheses. For example, for any gene or collection of genes of interest, it will be possible to ask:
- What other genes are co-expressed with this(these) gene(s) in a given subset of experiments (both from the researcher and in our database) AND
- Do the genes identified in (1) share (a) common expression QTL(s)? AND
- What are the predicted transcription factors within the QTL’s AND
- Do any of these transcription factors bind upstream of any of the co-expressed genes? AND
- Are the genes of interest closely interconnected in a predicted or known physical interaction network?
Core 2 will collaborate with Core 1 to design and implement a web server that allows external users to map expression data or gene lists onto any of the parameters in Bioverse. Core 2 will also develop tools for improved pattern recognition and for the analysis of eQTL data in outbred populations. Core 2 will work in close collaboration with the DBP’s and external researchers to develop and test our methods and visualization tools to assure that the software we develop is meeting the biological needs.
In order to adequately test our ability to infer pathways and to create testable hypotheses, and to drive the development of the software tools and user interfaces, Cores 1 and 2 of the Center will collaborate with a highly selected set of thematically integrated driving biological projects (DBP’s) in Core 3. These DBP’s were specifically selected to meet the following criteria.
- Each DBP has strong computational requirements to accomplish the biology of interest
- Each DBP will generate data and requirements that will test Core 1 and Core 2’s predictions and software. In addition
- Each DBP’s biological problem requires an improved understanding of the host innate immune response and will generate data that will shed light on the host innate immune response
While the tools developed in Cores 1 and 2 are
generally applicable to a broad range of biological problems, we have intentionally chosen DPB’s that are thematically focused in the general area of understanding the host-innate immune response. The creation of a core of thematically and scientifically integrated DBP’s offers a number of advantages relative to a more arbitrary selection of DBPs.
- Each core will generate data that can be sensibly integrated with data from the other cores. Each DPB will generate data that relates to the host innate immune response. While the approaches used and challenges to the innate immune response are quite different (e.g. bacterial LPS, IL-1 and HCV virus), in many cases the downstream pathways that are influenced are quite similar (e.g. NF?B responsive genes). The integration of data from all the individual DBPs with each other and other sources of publicly available host response data will provide a unique resource for dissecting and understanding the innate immune response. For example, it has been recently shown that HCV core protein signals through TLR2 and TLR4 similar to some bacterial LPS’s. A comparison of the similarities and differences in the host response to HCV core protein, IL-1 and bacterial LPS coupled with structural modeling of the interactions might shed light on the underlying signaling mechanism. In short, the collection and joint analysis of interrelated data sets will be a better test of the ability of Core 1 and Core 2 to generate useful, testable hypothesis.
- Communication between the PI’s of the DPB’s and core’s 1 and 2 will be both synergistic and simplified relative to communication between and with PI’s of unrelated biological projects.
- By integrating the DBPs with each other and Cores 1 and 2, the center is in an better position to spin-out DPB’s that can jointly apply for other institute specific center grants. For example, the current collection of researchers and research projects could easily be turned into a P-50 center application focused on innate immunity or (after preliminary data is available) a center focused on genetic influences to infectious diseases.
The proposed DBP’s are:
- DBP - 1 - Genetics of the Host Innate Immune Response - Mark Wurfel – PI. This DBP will map quantitative trait loci (QTL) that influence inter-individual differences in gene and protein expression in response to a well-defined innate immune stimulus, CpG oligodeoxynucleotides (CpG ODN). These studies will identify expression QTL (eQTL) that control innate immune responses in humans and develop a uniquely powerful dataset to study the mechanisms by which genetic variation mediates gene and protein expression levels. This core will generate data that will drive the development of methods and algorithms to identify expression QTL’s (Josh Akey’s work in Core 2) and will provide a rich set of data that will be used to annotate predicted protein-DNA interactions (a portion of Dr. Bumgarner’s work in Core 2).
- DBP - 2 - Host Response to Structurally Characterized Variants of LPS - Rich Darveau, Professor Oral Biology PI. This core will characterize the host response to a number of structurally characterized and related LPS variants. Some of these variants have been shown previously to bind to and signal through the same toll-like receptor (TLR) but elicit dramatically different innate immune responses (as measured by a small number of assays). This DBP will use genome-wide expression measurements to fully characterize the host transcriptional response and targeted experiments to identify the structure/function/interaction relationships between LPS variants and host genes that result in the differences in response. This core will make extensive use of the capabilities of Cores 1 and 2 to map the observed expression changes to the predicted physical interaction network of Core 1 and to generate testable hypotheses for why the host response varies as a function of LPS structure.
- DBP - 3 - Characterization of K-protein/RNA interactions in response to interleukin-1 (IL-1) in human and mouse B and T cell cultures - Karol Bomsztyk, Professor Medicine PI. K-protein interacts with a large repertoire of RNA’s and together with other RNA-binding proteins likely serves as an architectural element that governs transcriptome organization. RNA’s that are co-precipitated with K-protein will be identified using DNA microarrays. DBP3 will test the hypothesis that K protein binding properties define distinct classes of transcripts, and that these binding classes correspond to different mRNA fates. This DBP serves as a model for the generation of other protein-RNA interaction data and hence represents important technological and computational advancement that could be used to generate global protein-RNA data sets and transcriptome organization. Core 1 will develop computational tools to map RNA-protein interactions and to define transcriptome organization. Until now, Core 1’a construction of the intracellular networks has been limited to protein-protein interactions. This DBP will generate a unique set of RNA-protein interaction data that will be used by Core 1 to expand the physical interaction network of Bioverse to expand the rich RNA-protein interaction repertoire (Dr. Samuldrala’s work). It will also provide a rich substrate for the development of algorithms to better predict RNA structure and RNA protein binding motifs (Dr. Ruzzo’s work).
- DBP - 4 - Identification of Host Factors that Regulate HCV Replication - Stephen Polyak, Res. Assistant Professor Lab Medicine, PI. This project will investigate the host response to HCV genotype 1b infection. The overall goal of this project is to better understand how HCV genotype 1b, modulates the host innate immune response to result in a persistent infection that is refractory to treatment by interferon.
From the Center’s standpoint, the primary goal of Core3 it is to work with Cores 1 and 2 to drive the development of software and databases needed to understand the biology of interest and to assure that such tools are generalizable to other areas of study. From the researcher’s standpoint, the DBP projects are asking fundamental questions about the nature of the host innate immune response that require the computational approaches and tools developed in Cores 1 and 2.
In addition to the science in Cores 1-3, the center will be supported by a strong infrastructure core (Core 4), a training and education core (Core 5), a dissemination core (Core 6) and an administrative core (Core 7).
Ram Samudrala, PI
Core 4 will consist predominately of a systems administrator and a database administrator. The infrastructure core will be primarily responsible for maintaining production computer systems, databases and software developed in Cores 1 and 2 (the cores will maintain their own development systems). In addition, in years 02 and on, Core 4 will provide a technical writer whose role will be to document code and to produce (in collaboration with Core 6) professional training materials and user manuals.
The primary goal of Core 4 is to provide the systems and technical writing support necessary to deploy software that is useable by external researchers and researchers within the DBP’s.
Ira Kalet, PI
The overall goal of the Education Core is to build on the existing strong educational programs at the University of Washington in Biomedical and Health Informatics, Computer Science and Engineering and the biological sciences, by adding key components that will enhance cross-disciplinary education between the
information and computing sciences and the biomedical sciences.
Core 5 will:
- establish publicize and conduct a Distinguished Lecturer series at the University of Washington, focusing on the design, use and dissemination of biomedical informatics software
- establish, publicize and conduct an Annual Workshop on Biomedical Software
- establish, publicize and conduct a new research seminar at the University of Washington, for graduate credit, on biomedical software design, which will be open to students in any related graduate program, and which will focus on the software engineering aspects of major biomedical software projects
- establish, publicize and conduct a Summer seminar series for undergraduates serving as summer interns in various University of Washington biomedical, computing and related research groups and programs
- further develop and make more widely available training opportunities for biomedical researchers in the principles of biocomputing and use of specific bioinformatics software tools; and
- further develop course offerings and opportunities for computational scientists to acquire depth of knowledge in the biomedical sciences.
The primary goal of Core 5 is enhance cross-disciplinary education between the information and computing sciences and the biomedical sciences with a heavy emphasis on principles of software design, use and dissemination.
Peter Tarczy-Hornoch, PI
Core 6 will be focused on providing support for the dissemination of our software tools and databases. This will consist of a combination of web development, focused training and tutorials and professional software documentation.
The primary goal of Core 6 is to assure that the tools developed within the center are readily available to external researcher’s in the most usable form (e.g. easily distributed, well documented and associated with good training materials and courses) and that the software development efforts are responsive to user needs and inputs.
Roger Bumgarner, PI
Core 7 will provide the basic fiscal and secretarial support to run the center, to organize internal and external advisory board meetings and to assist in recruiting of new DBP’s and other interacting scientific proposals
The overall goal of Core 7, simply put, is to assure that the Center meets its scientific and public objectives in a timely and cost effective manner.
The center itself will serve as a core computational biology resource to researchers worldwide by providing tools, databases, methodologies and training to allow researchers to take a “systems biology” approach to their projects. We will create a web-based toolkit that will allow biologists to integrate and visualize their data in the context predicted and measured physical interactions, functional annotations and internal and externally generated data sources.
Hence, the overall goal of the Center is to assist biologists in the rapid generation of testable, mechanistic hypotheses that will provide a crucial bridge between their functional genomics data and biological understanding.
Why will the Center for Pathway Inference be a national resource?
As stated at the outset of this introduction, a fundamental issue facing modern biology is the development of methods and tools to convert genomic and functional genomics data into mechanistic understanding. We, and many others [2, 4, 6-12] are of the opinion that this is best accomplished by creating tools that allow one to visualize and analyze functional genomics data in the context of known and predicted networks and pathways. Indeed, a number of researchers outside of this center are taking similar approaches and are creating related open source tools for this kind of analysis (for example - Cytoscape [13], GenMAPP [14, 15], PathMAPA [16] and others). Our center approach builds upon this growing body of literature to create user friendly tools that are both web accessible and, most importantly,
linked to a hosted database of predicted structures, physical interaction networks and functional annotations for more that 50 genomes. In addition, the center will continue to create and improve upon algorithms and methods to produce these predictions and functional annotations and will provide improved analysis tools for ncRNA sequence data, eQTL data, and for mining expression data in the context of other data sets.
The advantages of UW Center for Pathway Inference methods relative to other approaches are:
- Many genomes for which genome wide interaction networks are not available nor likely to be measured soon --- e.g. our tools and resources will enable research in a number of genomes/biological systems (rice and many bacteria including agrobacterium, salmonella, pseudomonas aeruginosa,vibrio cholerae and yersinia pestis). The prediction algorithms in Core 1 are used to extend measured data in one or more genomes to predictions in any arbitrary genome.
- Hosted-database of physical interactions - with other systems (e.g. CytoScape), one has to create the database, populate it with interactions (only available for yeast, e-coli, c. elegans and drosophila) and then download and install the visualization software and point it to the data. This is beyond the capabilities of the average biological lab.
- Web accessible visualization and analysis tools. In general, web accessible software (when it is well written) is easier for the end-user and easier to keep up-to-date and in-synch with associated database changes.
- Links to predicted functional annotations - In addition to predicted physical interactions, the Bioverse approach provides predicted structural and functional annotations for the functions of all genes in the processed genomes. For many genes, other public sources of functional or structural information are not available elsewhere.
- Integration of CHIP-to-Chip data with eQTL data and cluster analysis - As discussed in detail in Core2 and DBP1, expression QTL’s are a rich source of information related to gene regulation. We anticipate that a great deal of eQTL data will be generated external to this center in the next several years. Core2 will develop the methods to analyze eQTL data in outbred populations and will use eQTL data in combination with other measurements (CHIP-to-Chip data and clustering) to produce improved predictions and probability estimates of protein-DNA interactions.
- The ability to visualize external data in the context of other related public data (e.g. public gene expression data) and the predicted networks.
In addition to the software and databases we will create, the center is committed to training existing and new scientists in both the principles of robust software design and the emerging field of systems biology. Core 5 is designed with the teaching mission of the University in mind and has a strong emphasis on training both internal and external researchers in the principles of good software design. Through a combination of course development, seminars and workshops, Core 5 will greatly expand the biocomputation educational opportunities available at the UW elsewhere. In addition to these more "formal" educational efforts, the center will (of course) train graduate students and postdoctoral fellows via direct participation in the research mission.
Between all the cores, the center will fund a total of 6 graduate students and 6 postdoctoral fellows with the majority of these in Cores 1 and 2. Current and recent students of the faculty in the Center have been exceptionally productive. For example, KaYee Yeung (now a Research Assistant Professor but most recently a post-doctoral fellow in Dr. Bumgarner’s lab and prior to that, a Ph.D. student with Dr. Ruzzo), has 13 publications within the last four years including eight first author efforts (two of which were amongst the most highly cited papers in computer science in 2002 and 2004). Kai Wang, a second year graduate student in Dr. Samudrala’s lab, published six papers in 2004 alone (of which five were first author efforts). Ultimately, students and postdoctoral fellows that we train within the center (such as the two mentioned above), may make the biggest contribution from the center to the national resource as these individuals will go on to establish their own labs and train others.
Today, an ever-increasing number researchers apply the tools of functional genomics (array and proteomic measurements) to their biological systems. As this type of data continues to pour into the literature, it is becoming increasingly apparent that, in many cases, very little new biological understanding is being developed [17]. The UW Center for Pathway Inference (UW-CFPI) is designed to bridge the gap between the collection of large-scale functional genomics data and mechanistic understanding of the underlying biology that is responsible for the observations. Our primary efforts are focused around the creation of tools that will allow researchers to mine their data in the context of predicted physical interaction networks, functional annotations and other related data sets. We feel that tools such as the ones proposed in this center are critical if we are to fully reap the rewards of the effort and resources that have been put into the development and application of functional genomics tools such as expression arrays and proteomics methodologies. By building on existing software, database and human resources at the University of Washington, the proposed center is well positioned to become an important national resource for computational biology.
References
- Ideker, T., A systems approach to discovering signaling and regulatory pathways--or, how to digest large interaction networks into relevant pieces. Adv Exp Med Biol, 2004. 547: p. 21-30.
- Yeang, C.H., T. Ideker, and T. Jaakkola, Physical network models. J Comput Biol, 2004. 11(2-3): p. 243-62.
- Ideker, T., Systems biology 101--what you need to know. Nat Biotechnol, 2004. 22(4): p. 473-5.
- Ideker, T., T. Galitski, and L. Hood, A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet, 2001. 2: p. 343-72.
- Ideker, T., et al., Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 2001. 292(5518): p. 929-34.
- Ideker, T.E., V. Thorsson, and R.M. Karp, Discovery of regulatory interactions through perturbation: inference and experimental design. Pac Symp Biocomput, 2000: p. 305-16.
- Spivey, A., Systems biology: the big picture. Environ Health Perspect, 2004. 112(16): p. A938-43.
- Dhar, P.K., H. Zhu, and S.K. Mishra, Computational approach to systems biology: from fraction to integration and beyond. IEEE Trans Nanobioscience, 2004. 3(3): p. 144-52.
- Butcher, E.C., E.L. Berg, and E.J. Kunkel, Systems biology in drug discovery. Nat Biotechnol, 2004. 22(10): p. 1253-9.
- Ishii, N., et al., Toward large-scale modeling of the microbial cell for computer simulation. J Biotechnol, 2004. 113(1-3): p. 281-94.
- Covert, M.W., et al., Integrating high-throughput and computational data elucidates bacterial networks. Nature, 2004. 429(6987): p. 92-6.
- Huang, S., Back to the biology in systems biology: what can we learn from biomolecular networks? Brief Funct Genomic Proteomic, 2004. 2(4): p. 279-97.
- Shannon, P., et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 2003. 13(11): p. 2498-504.
- Doniger, S.W., et al., MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol, 2003. 4(1): p. R7.
- Dahlquist, K.D., et al., GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet, 2002. 31(1): p. 19-20.
- Pan, D., et al., PathMAPA: a tool for displaying gene expression and performing statistical tests on metabolic pathways at multiple levels for Arabidopsis. BMC Bioinformatics, 2003. 4(1): p. 56.
- Quackenbush, J., Genomics. Microarrays--guilt by association. Science, 2003. 302(5643): p. 240-1.