The vast amounts of data generated by high throughput sequencing have broadened our understanding of structural and functional genomics through the concepts of “omics” ranging from basic genomics to integrated systeomics, providing new insight into the workings and meaning of genetic conservation and diversity of living things.
The journey of finding relevant biomedical datasets through public resources such as GEO, TCGA, GTEx etc. is time taking and not fruitful since many relevant datasets are not identified as these datasets don’t follow a standard. Once biological curation is done for these datasets, the data is standardized and harmonized which eases the process of data mining.
Here, we focus on why data curation is important, and how the process of finding relevant curated datasets can be accelerated to make drug discovery faster.
Data curation is the process of maintaining and preserving data in a useful and accessible format over time. This includes organizing, documenting, and preserving data so that it remains understandable, usable, and trustworthy. The data coverage and accuracy of a database need to be therefore continuously enhanced, and a primary way of accomplishing this goal is through a critical process called biological curation, i.e., extracting biological data from scientific literature and integrating it into a biological database. In the context of biomedical data, curation is critical for ensuring that the data is accurate, complete, and usable for future research and analysis.
Dealing with this overwhelming influx of data in a responsible way means ensuring that
(1) it is of high quality,
(2) it has meaningful metadata,
(3) it is stored in such as way that it will persist over time, and,
(4) it is viewed in the context of similar data, so that comparisons and new insights can be made.
If new datasets are not curated into databases for long-term sustainability and integrated with pre-existing data, they may lose their accessibility and utility over time. If new, important data sets are not used, knowledge production and discovery rates will lag behind data production rates. In other words, data must be captured, standardized, organized, and made accessible to the scientific community if it is going to have a significant and lasting impact.
These datasets are available on the public platforms like GEO but they are not ready to use and require a lot of data cleaning, harmonization and standardization.
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. A user can search for these datasets by searching the keyword of interest in the Search bar. Since the data on GEO is not harmonized and streamlined, all the relevant hits do not show up in the top results. Findability and reusability of these datasets can be questionable since biomedical data is not FAIR and curated accurately.
Providing access to FAIR (Findable Accessible Interoperable and Reusable) multi-omics data from both public and private sources, Polly is a data-centric ML Ops platform for biomedical data. ML models are used to harmonize and curate data from diverse sources, making it machine-actionable and analysis-ready. By providing a toolkit of scalable, easily customizable bioinformatics pipelines, Polly's cloud infrastructure facilitates seamless data analysis, visualization, and sharing.
Let’s look an example here and see how the journey of finding relevant curated datasets is accelerated using Polly.
80% of a scientist's time is spent gathering data from the public and preparing it for study. Polly contributes to the reversal of this ratio by delivering pertinent datasets in a short amount of time and drastically reducing the time required to make the data usable. Let’s look at a specific query.
1. Query:
Datasets studying human liver transcriptome from obese and normal subjects
Inclusion criteria:
2. Keywords: OBESITY, LIVER, CONTROL, NORMAL, HOMO SAPIENS, HUMAN, NO CELL LINE
Discovery of relevant datasets for any disease is 83% faster on Polly in comparison to source (GEO). Along with that, Polly also provides access to third party applications like Phantasus and Cellxgene which can empower your journey of data exploration by providing quick insights. Probability of finding a relevant pool of datasets will be 21% more on Polly since the number of relevant datasets found on Polly are much more than GEO since curation helps to find the datasets of interest. Whole metadata summary is available on one consolidated page on Polly. Users will not have to spend a large amount of time navigating to various pages to extract necessary information.
If you are spending time in scouring datasets to just find out relevant ones for downstream analysis, now is the time to reach out. Connect with us to learn more on how to accelerate your research.