The vast amount of genetic information made available using various high throughput experiments has opened up the possibility to systematically study inter-individual differences in drug response. The gene sequence variations obtained from genotyping studies are stored in file formats known as Variant Call Format (VCF). This format was created to identify variations in a large number of related samples after the advent of large-scale genotyping projects such as the 1000 genome project.
Now, it doesn’t take a genius to guess that VCF files are very large in size owing to the vast amount of information they contain. We have close to 23TB of VCF data on Elucidata’s cloud platform, Polly, which made it important to understand which storage tools would be efficient and cost-effective for our platform. Recently, we conducted a benchmarking study to identify the optimal tool for storing and querying the VCFs. For this, we ingested 93 million variants obtained from the GNOMAD public repository with a size of 210 GB (in compressed form) into two AWS storage tools -
1. Amazon Redshift
2. Delta Lake queried using AWS Athena
This exercise helped us to estimate the cost and time required for storing and querying vcf data on these two storages.
We evaluated and compared the tools on multiple parameters such as infrastructure, data distribution, disk space consumed, and response time for running queries.
Amazon Redshift is an Online analytical processing (OLAP) database offered by AWS which is highly scalable and fully managed. Redshift offers different distribution styles and sort keys which makes querying the data more efficient.
For this benchmarking, we chose the Distribution Key as the distribution style with the Chromosome column as the column for the distribution key along with the Position column as the sort key.
[For quick reference:
- The distribution key style uses a specific column as a distribution key and the hashed value of the rows in the distribution key column would be used to distribute the data across the nodes, which means all rows with the same hashed values would be stored together in the same node. It is a best practice to choose the column with low cardinality as the distribution key.
-A single column can be used as a sort key which would improve the query performance on that specific column.]
About Athena and Delta Lakes:
Athena is a serverless pay-per-query service offered by AWS which is helpful to query external tables. The pricing of the query is determined based on the amount of data scanned during the query execution. It also requires the definition of all external queryable tables to be present in a Hive-compatible metastore.
Delta Lake is an open-source storage layer that brings reliability to data lakes. It enables reads and writes in open-source Apache Parquet file format, and provides capabilities like ACID transactions, scalable metadata handling, and unified streaming and batch data processing.
For this benchmarking, we loaded the data into a Delta Lake stored in an S3 bucket and used Athena to run queries on it.
To compare the query performance, we executed a few queries and recorded the time taken by Athena and Redshift to return the query results. As listed in the table below, at this scale of 93 million rows, Athena was consistently returning the results at a similar runtime. However, Redshift performed better when the query is executed primarily on the Chromosome Column which is defined as the distribution key in Redshift.
Since Athena is serverless and its cost is calculated for the amount of data scanned, it would be difficult to directly compare it with Redshift which has a live server constantly running and billed on an hourly basis. This means Redshift has a basic cost even with 0 queries running as a continuously running server is required. For Athena, the cost is linearly dependent on the number of queries.
This makes Athena highly beneficial when the frequency of queries executed highly fluctuates. Redshift on the other hand is useful where continuous high-volume of queries are executed.
However, a true comparison of cost can be made with a clear estimate of the number of instances needed in Redshift is compared to the number of queries run on Athena.
Our Benchmarking exercise shows that to decide which tool would serve a particular user better, it would be helpful to take a close look at the volume of inbound VCF data, the frequency at which queries are executed, and the consistency of query execution time. It would be highly beneficial to run a similar analysis using these parameters to guide users in making an informed choice for the storage tools which could optimize the compute and storage cost as well as the query execution time.
Disclaimer: This study evaluated the tools for the ingested 93 million variants and the results may not be comparable if the number and size of the file are much larger or smaller.
Get the latest insights on Biomolecular data and ML