Omixatlas - A Data Centric Approach to Research with Ml-ready Biomolecular Data

Omixatlas - A Data Centric Approach to Research with Ml-ready Biomolecular Data

Jayashree
October 20, 2022

Biological multi-omics data hold tremendous potential for reuse and discovery. An enormous amount of data is being generated and made public by academic labs and organizations worldwide. However, the data is scattered across multiple sources and lacks standardization. It is un(FAIR) as the availability of data does not equate to its usability. Elucidata’s data warehouse, OmixAtlas, is a repository of FAIR (Findable, Accessible, Interoperable, Reusable) data. It is a collection of millions of datasets from public, proprietary, and licensed sources that have been curated, harmonized, and made ready for downstream machine learning and analytical applications. It is one central location to access data over 26 data types from over 30 public repositories and licensed sources.

All datasets on Polly go through a 2-step process:

  1. Data Engineering: This includes transforming data to fit a proprietary data schema that is uniform across several datatypes. The transformation streamlines data in one consistent schema and allows users to query multiple data types on a single data infrastructure.
  2. Metadata Harmonization: This involves tagging each sample and dataset with a uniform ontology.
Elucidata OmixAtlas
Making Data FAIR

How Is Data Structured on Omixatlas?

Data schema: the data available within OmixAtlas is curated within defined indexes on the basis of the information it contains. These indexes are:

  1. Dataset-level metadata (index: files): Contains curated fields like drug, disease, tissue organism, etc., for each dataset.
  2. Sample-level metadata (index: gct_metadata, h5ad_metadata, and biom_metadata): Contains curated fields like cell lines, experimental design, etc., for each sample.
  3. Feature level metadata (gct_row_metadata, h5ad_data, and biom_data): Contains the gene/molecule symbol along with the feature intensity for each sample.
  4. Variant-related data (index: variant_data): Contains the schema for variant-related information present in vcf files.

Accessing OmixAtlas

OmixAtlas provides access to thousands of tissue-derived or disease-specific multi-omics datasets from multiple sources in one place. The data can be accessed and analyzed on the same computational infrastructure.

The datasets on Polly can be accessed through GUI or programmatically with Polly Python.

Polly Python library provides convenient access to the below-mentioned functionalities through functions in Python language.

  1. Creating and updating an OmixAtlas
  2. Querying data and metadata
  3. Downloading any dataset
  4. Working with Workspaces
  5. Working with the data schema
  6. Ingesting data on OmixAtlas
  7. Working with cohorts

Polly library allows access to data in OmixAtlas over any computational platform like SageMaker, Polly, etc.

The details of datasets can be easily visualized easily over UI as well.

Elucidata OmixAtlas
OmixAtlas Landing Page

Elucidata OmixAtlas
GEO OmixAtlas Summary
  1. General details such as the number of tissues, diseases, organisms, etc., are displayed on the summary page of an OmixAtlas.  
  2. OmixAtlas offers filtering options using which a user can narrow down search results across metadata fields such as dataset id, no. of samples, description, drugs, cell type, cell line, disease etc.
  3. Dataset download is also possible over UI.

Features of High Significance on Polly

• While handling enormous data and while working on different omics datasets, do you have the need to group samples from multiple OmixAtlases so that it becomes easy to analyze data from different datasets/across repositories?

Look no further! We’ve got you covered with our super useful feature Cohorting which allows you to group datasets or samples based on metadata of interest on Polly. This feature enables you to study the difference between two cohorts- for example. Diseased vs Normal or Cancerous vs Non-Cancerous cells.

• Missing out on datasets while querying just because your search term does not match the ontological term?

For instance, while querying datasets for the disease IBD, the ideal result set must include datasets annotated with diseases - ‘inflammatory bowel diseases', ‘inflammatory bowel diseases, Crohn's disease’, and ‘inflammatory bowel diseases 8’. However, expansion of a keyword doesn’t happen under the hood, resulting in a lesser number of valid hits.

To overcome this, Polly has the ‘Ontology Recommendations’ functionality integrated into Polly-Python. This functionality aims to provide more valid hits in fewer user efforts. The expansion of the keyword happens implicitly, reducing the manual interventions.

For example, if the user tries to query the dataset for the disease ‘obesity’, the result set of ontological recommendations would also include the searches for the terms -

Elucidata OmixAtlas

• With tons of data generated & published in the public repositories every year, do you find it challenging to find out the accurate resource required to curate & harmonize them to our needs?

Our Curation app is the solution to all the curation woes. It helps you curate, standardize & harmonize all the clinical data that you’ve generated in a double-blinded manner to convert them into analysis-ready formats!

Along with standard metadata curation, we also offer custom metadata curation wherein users will be able to curate a field of their choice. For instance, the curation of cancer stage, BMI etc. The user will be able to define the custom column header, and ontology to be used if any.

• Visualization apps

  • Spotfire Integration is readily available on Polly to help you visualize. It saves a lot of time since it is available on the same platform and there is no need to set up a separate integration. Datasets on Polly are richly annotated with a variety of metadata labels. The data can now be easily analyzed, visualized, and shared in the form of dashboards.
  • Cell X gene is a widely used tool for visualizing processed single-cell datasets. It is an interactive application and is primarily used for preliminary analysis and exploration of single-cell datasets.

Public and Enterprise Omixatlas

Public OmixAtlas is a repository of more than 1.5 million datasets and 4.1 million samples aggregated from 32 publicly available sources. In addition, managing in-house data at scale can also be done with our Enterprise OmixAtlas where proprietary data is standardized and curated. This helps in significantly reducing the time spent on processing datasets.

Benefits of Public OmixAtlas:

  1. Allows querying of millions of records in a single request using SQL.
  2. Analysis of data using algorithms of choice.
  3. Accessing data using any computational infrastructure.
  4. Conversion of data formats to suit analysis needs.
  5. Augmenting internally generated data with public studies.
Elucidata Public OmixAtlas
Public OmixAtlas

Benefits of Enterprise OmixAtlas:

  1. Using Polly’s technology for storing, curating, and managing biomolecular data at scale.
  2. Access to built-in filters on Polly’s UI or other GUI-based platforms and also Polly's programmatic interface to find and interpret data in Enterprise OmixAtlas.
Elucidata Enterprise OmixAtlas
Proprietary data standardized and curated - enabling analysis and novel insights

Contact us if you want to learn more about using our 1.5 million curated datasets to train your models or to take advantage of our data-centric platform Polly, to find and analyze relevant datasets.

Request Demo