Integration Pays Off!

Evidence from discovery of functionally coherent gene groups in budding yeast

What is a common to bread, beer and wine? They are all made thanks to the magical ingredient – Saccharomyces cerevisiae, generally known as budding yeast. But for geneticists budding yeast is much more than that. It is a laboratory superhero. It is flexible, safe and fast growing organism. Studies of this genetic model organism help in uncovering gene functions, answering fundamental biological questions and promise to shed light on biological processes in larger eukaryotes due to genetic conservation.


Recent developments in molecular biology and techniques for genome-wide data acquisition are releasing a flood of data to profile genes and predict their function. These data sets may come from diverse sources and it is an open question how to commonly address them and fuse them into a joint prediction model. A prevailing technique to identify groups of related genes that exhibit similar profiles is profile-based clustering. The function of uncharacterized genes can be inferred from the prevailing function of other genes in the cluster. This “guilt by association” principle assumes that gene clusters are functionally enriched, that is, genes with similar functions will cluster together, making the clusters coherent in terms of functions.


Gene profile data on the budding yeast represent avenue for exploring new bioinformatics methods. In bioinformatics, integrative approaches are motivated by the desired improvement of robustness, stability and accuracy. Cluster inference may benefit from consensus across different clustering models. With this in mind, BioSense researchers have proposed a technique that develops separate gene clusters from each of available data sources and then fuses them by means of non-negative matrix factorization.


Our study encompassed the collection of diverse data sources: metabolic cycle gene expressions measured at different time points, data sets from the Saccharomyces genome database and double mutant phenotypes inferred by synthetic genetic arrays. We have demonstrated that our approach can successfully integrate heterogeneous data sets and yield high-quality clusters that could otherwise not be inferred by simply merging the gene profiles prior to clustering. The quality of discovered clusters stems from data integration, as different data sources may provide different but complementary insight into the observed system. The integrative approach based on nonnegative matrix factorization is robust and can infer gene groups with high functional enrichment and compares favourably to alternative integration approaches.

Integration method fuses clustering stemming from different data sets, different data pre-processing steps or different clustering techniques. Steps of integrative approach include pre-processing of gene profiles, estimation of similarity scores, inference of gene networks, clustering and integration of results in joint genes-clusters matrix. Final clusters are obtained by non-negative matrix factorization.