Data Science & Machine Learning

Nextflow on Polly: Streamlining Bioresearch Pipelines in the Cloud

Sahil Rai
January 24, 2024

In the fast-paced world of biomedical research, conquering mountains of data is essential for groundbreaking discoveries. However, wrangling diverse datasets, processing them efficiently, and orchestrating complex bioinformatics analyses can be a tedious task. The solution? Nextflow!

Nextflow offers a versatile workflow engine to accelerate key research milestones.

What is Nextflow?

Nextflow is an open-source workflow manager tailored for scientific workflows, particularly in bioinformatics and data analysis. Its declarative DSL simplifies workflow definition, ensuring portability across diverse computing environments, including local machines, clusters, and cloud platforms.

With built-in support for parallel and distributed execution, Nextflow optimally utilizes resources, scaling from laptops to large clusters. It seamlessly integrates with containerization technologies, promoting reproducibility and consistent execution. Nextflow simplifies the creation, sharing, and execution of complex computational workflows for researchers and scientists through data-driven dependency resolution, user-friendly design, and a supportive community.

What is Polly?

Polly is Elucidata’s data harmonization platform that makes vast amounts of scattered biological multi-omics data readily usable by researchers by processing it through pipelines and annotating datasets with rich metadata hence improving the data quality. This makes data more readily findable, accessible, interoperable, and reusable (FAIR).

What’s Unique About Polly’s Hosting of Nextflow?

In contrast to the stock version, Nextflow on Polly features numerous enhancements to its infrastructure. The following details the improvements:

  1. Zero Infrastructure Management
    Perhaps one of the most significant advantages of using Nextflow on Polly is the elimination of infrastructure management hassles. Nextflow on Polly is carefully designed with bioinformaticians in mind. All you have to do is focus on writing beautiful pipelines and forget managing complex infrastructure!
    To provide context on the time saved by utilizing Nextflow on Polly compared to building a similar infrastructure from the ground up, consider this:
    Creating such an infrastructure independently would necessitate the efforts of a skilled engineering team for a substantial period, ranging from 9 to 12 months. This timeframe is crucial for ensuring scalability, cost-effectiveness, and user-friendliness tailored to bioinformaticians. Furthermore, the ongoing responsibility of maintaining the infrastructure would persist throughout its usage. By opting for Nextflow on Polly, researchers can circumvent these time-consuming endeavors, streamlining their workflow and enhancing efficiency in the pursuit of scientific advancements.
  2. Effortless Scalability
    One of the standout features of Nextflow on Polly is its ability to handle a multitude of concurrent executions of pipelines. Whether you're analyzing a single sample or a terabyte-sized dataset, Nextflow on Polly scales seamlessly. Researchers now have the flexibility to run multiple pipelines simultaneously, enhancing efficiency and reducing the time it takes to execute complex analyses.

    This remarkable capability is made achievable through the integration of a custom file system and a multi-cluster architecture engineered for scalability. Our custom file system ensures nearly limitless storage space for your pipelines. The notable advantage is that you need not proactively specify the required storage capacity for your pipeline; it dynamically scales according to demand. The presence of a multi-cluster architecture ensures independence from a single cluster for all your computational needs. We've simplified the process of adding extra clusters on the fly to accommodate increased computing requirements effortlessly.
    To validate the scalability, we conducted tests using an RNA-Seq pipeline (Kallisto). The results were noteworthy as we efficiently processed approximately 60,000 samples, each averaging close to 2.5GB in size, within a month on our infrastructure and still left with spare capacity.
  3. Cost Optimization at the Core
    In today's resource-intensive research environment, cost optimization is a key consideration. Nextflow on Polly is optimized for cost efficiency through the utilization of AWS Spot instances and other innovative cost-saving techniques. By leveraging these cost-effective resources, researchers can maximize computational power while minimizing expenditure, making cutting-edge research more accessible and sustainable.
    Our commitment to cost efficiency extends beyond infrastructure considerations; it is ingrained in the very fabric of our pipeline development. Notably, certain pipelines, such as RNA-seq, have been crafted with a leaner design compared to publicly available alternatives or offerings from other vendors. This emphasis on efficiency ensures that researchers can achieve robust results while optimizing resource consumption. By implementing cost-saving measures both at the infrastructure and pipeline levels, Nextflow on Polly offers a comprehensive solution that aligns with the financial considerations of researchers, making cutting-edge research not only achievable but also economically sustainable.

    In our comparative analysis with other vendors offering Nextflow pipeline execution environments, we emerged as 30% more cost-effective than the most economical alternative available in the market.

Added Benefits of Using Nextflow on Polly

  1. Streamlined Data Organization with Polly Workspaces Integration
    Polly Workspaces empowers researchers to declutter their datasets and organize them in a folder system of their choice. This integration enables researchers to easily import data directly from workspaces into their Nextflow pipelines and back. This feature enhances collaboration and ensures that researchers can easily locate and access the data they need, streamlining the entire research process.
  2. Exporting to Polly Atlas for Search & Query
    Polly’s Atlases functions as a curated repository for biomedical data, alleviating the need for manual data searches through its built-in filtering and search capabilities. Researchers can effortlessly export their datasets directly to Atlas from Nextflow pipelines, simplifying the process of transforming data to adhere to a uniform proprietary data schema across various datatypes. This transformation ensures data consistency within a singular schema, enabling users to query multiple data types seamlessly on a unified data infrastructure. With the capability to focus on analysis, researchers can entrust Atlas to handle the storage and management of their data efficiently.
  3. Python Library for Execution and Monitoring
    Nextflow pipelines on Polly can be executed and monitored seamlessly using the Polly Python library. This feature enables integration with other software and tools, providing researchers with the flexibility to incorporate Nextflow pipelines into their existing workflows. Automation becomes a reality, offering efficiency gains and enhanced collaboration between different tools and processes.

With Nextflow on Polly, the days of wrangling infrastructure are over. Forget managing clusters, scaling storage, or optimizing costs – unleash your scientific focus on groundbreaking discoveries. Our custom infrastructure handles the heavy lifting, scaling seamlessly and adapting to your needs. Whether you're analyzing a single sample or terabytes of data, Nextflow on Polly delivers the power you need, affordably.

Reach out to us at info@elucidata.io to know more or to talk to our experts.

Blog Categories

Blog Categories

Request Demo