Gemma: A Treasure Trove for Genomic Data Gee(q)s

Since the turn of the century, the development of new methods and platforms, combined with improved computational infrastructure and reduced costs have made it easier for labs to adopt high-throughput technologies. Consequently, massive amounts of genomics data are being generated. These data are not limited to answering just a single research question; they harbor immense potential for reuse. There are several public repositories that promote data reuse, each dedicated to a specific kind of omics data or a specific disease.

However, data abundance is a measure of quantity, not quality. Public repositories often do not have specific guidelines that have to be followed while depositing data. As a result, a great deal of time has to be spent in cleaning and pre-processing, in order to bring the data to a usable form. One way of accelerating the process of discovering novel insights from legacy data, is by curating data.

A Curated Database of Genomics Datasets

Gemma was established in 2012, by the Pavladis lab at the University of British Columbia. It is a database containing approximately 10,000 genomics datasets that have been curated to enable meta-analysis. Data have been sourced from multiple public repositories, primarily the Gene Expression Omnibus (GEO). It hosts microarray as well as RNA-sequencing data and offers support for numerous platforms such as Affymetrix, Illumina and nucleotide arrays. An important feature of Gemma is the data annotation done by both manual and automated means. This enhances the ease of data usability from the database. Apart from serving as a database, Gemma also hosts web-based tools for data exploration and discovery.

Geeq-ing Out Over Data Quality

A key difference between public repositories and Gemma is the emphasis on the quality of data being hosted. Once expression data has been imported from the public source, curators include corresponding annotations and sample metadata. Array and experimental designs are other aspects that are subjected to quality checks. Array design undergoes sequence analysis and corresponding gene assignment. Experimental design is closely examined to ensure that datasets meet certain criteria (such as a minimum sample number, contain minimal outliers or missing data). Gemma datasets are put through a quality assessment known as Geeq, which takes into consideration the quality and suitability of the data.

Enabling Data Reuse

Gemma aims to streamline and ease the process of genomics data reuse by going the extra mile to ensure that all the data added to the platform are curated and subjected to rigorous quality checks. By eliminating the time spent on making the data usable, Gemma helps scientists focus on the more important task of using data to answer research questions and push the boundaries of science as we know it.