Amazon Redshift vs Athena and Delta Lake for Storing Gene Variants- A Benchmarking Study

The vast amount of genetic information made available using various high throughput experiments has opened up the possibility to systematically study inter-individual differences in drug response. The gene sequence variations obtained from genotyping studies are stored in file formats known as Variant Call Format (VCF). This format was created to identify variations in a large number of related samples after the advent of large-scale genotyping projects such as the 1000 genome project.

Now, it doesn’t take a genius to guess that VCF files are very large in size owing to the vast amount of information they contain. We have close to 23TB of VCF data on Elucidata’s cloud platform, Polly, which made it important to understand which storage tools would be efficient and cost-effective for our platform. Recently, we conducted a benchmarking study to identify the optimal tool for storing and querying the VCFs. For this, we ingested 93 million variants obtained from the GNOMAD public repository with a size of 210 GB (in compressed form) into two AWS storage tools -
1. Amazon Redshift
2. Delta Lake queried using AWS Athena
This exercise helped us to estimate the cost and time required for storing and querying vcf data on these two storages.

‍

We evaluated and compared the tools on multiple parameters such as infrastructure, data distribution, disk space consumed, and response time for running queries.

About Redshift:

Amazon Redshift is an Online analytical processing (OLAP) database offered by AWS which is highly scalable and fully managed. Redshift offers different distribution styles and sort keys which makes querying the data more efficient.
For this benchmarking, we chose the Distribution Key as the distribution style with the Chromosome column as the column for the distribution key along with the Position column as the sort key.
[For quick reference:
- The distribution key style uses a specific column as a distribution key and the hashed value of the rows in the distribution key column would be used to distribute the data across the nodes, which means all rows with the same hashed values would be stored together in the same node. It is a best practice to choose the column with low cardinality as the distribution key.
-A single column can be used as a sort key which would improve the query performance on that specific column.]

About Athena and Delta Lakes:

Athena is a serverless pay-per-query service offered by AWS which is helpful to query external tables. The pricing of the query is determined based on the amount of data scanned during the query execution. It also requires the definition of all external queryable tables to be present in a Hive-compatible metastore.

Delta Lake is an open-source storage layer that brings reliability to data lakes. It enables reads and writes in open-source Apache Parquet file format, and provides capabilities like ACID transactions, scalable metadata handling, and unified streaming and batch data processing.

For this benchmarking, we loaded the data into a Delta Lake stored in an S3 bucket and used Athena to run queries on it.

Benchmarking Results- RedShift Vs Athena

To compare the query performance, we executed a few queries and recorded the time taken by Athena and Redshift to return the query results. As listed in the table below, at this scale of 93 million rows, Athena was consistently returning the results at a similar runtime. However, Redshift performed better when the query is executed primarily on the Chromosome Column which is defined as the distribution key in Redshift.

Since Athena is serverless and its cost is calculated for the amount of data scanned, it would be difficult to directly compare it with Redshift which has a live server constantly running and billed on an hourly basis. This means Redshift has a basic cost even with 0 queries running as a continuously running server is required. For Athena, the cost is linearly dependent on the number of queries.
This makes Athena highly beneficial when the frequency of queries executed highly fluctuates. Redshift on the other hand is useful where continuous high-volume of queries are executed.

However, a true comparison of cost can be made with a clear estimate of the number of instances needed in Redshift is compared to the number of queries run on Athena.

Our Benchmarking exercise shows that to decide which tool would serve a particular user better, it would be helpful to take a close look at the volume of inbound VCF data, the frequency at which queries are executed, and the consistency of query execution time. It would be highly beneficial to run a similar analysis using these parameters to guide users in making an informed choice for the storage tools which could optimize the compute and storage cost as well as the query execution time.

Disclaimer: This study evaluated the tools for the ingested 93 million variants and the results may not be comparable if the number and size of the file are much larger or smaller.

‍

‍

‍‍

Blog Categories

Data Analysis and Management

Data Quality & Compliance

Industry Features

Product & Engineering

Data Science & Machine Learning

Company & Culture

FAIR Data

Others

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Other Resources

Case Studies Dataset Roundup Documentation Glossary Solution Briefs Webinars Whitepapers

Upcoming Webinar - AlphaGenome Unpacked: Promise, Progress, and What Comes Next for AI in Genomics

Join us

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Amazon Redshift vs Athena and Delta Lake for Storing Gene Variants- A Benchmarking Study

About Redshift:

About Athena and Delta Lakes:

Benchmarking Results- RedShift Vs Athena

Blog Categories

Talk to our Data Expert

Other Resources

Related Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Blog Categories

Get the latest news, industry insights, and updates delivered directly to your inbox.

Latest Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Trending Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Scaling Data Pipelines for High-throughput Bioinformatics

Decoding Complexities: The Critical Role of Deconvolution in Spatial Transcriptomics

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io