Data pipelines are an essential tool in bioinformatics. They enable the standardization of data-cleaning and complex data-processing which in turn results in better reproducibility. Most data pipelines use some kind of framework for executing its tasks. Modern pipeline orchestration frameworks do a good job of abstracting and separating pipeline logic from infrastructure complexity. But that complexity remains to be managed at some level.
Before we talk about the specifics of scalability, let us first consider a simplified model of the lifecycle of a bioinformatics data pipeline:
Bioinformaticians are primarily trained and skilled at developing algorithms, performing data analysis, and interpreting the biological implications of data. Their expertise is central to understanding and solving biological problems through computational methods. Therefore, they will be highly involved in the development and interpretation stages of the pipeline. However, it is not a good use of their time to be responsible for managing deployment and configuring execution environments as well.
Deployment and maintenance of pipelines in a cloud environment requires understanding of cloud services, configuration management, containerization technologies (like Docker) and CI/CD principles. These tasks are time-consuming and can divert bioinformaticians from their primary role – research and data analysis. Bioinformatics pipelines have highly dynamic storage and computational needs as well. Efficiently managing computational resources to handle variable workloads, optimizing costs, handling scaling issues, and ensuring data security involves skills generally associated with cloud solutions experts. Getting involved with this will be yet another distraction.
This is where Polly Pipelines, Elucidata’s managed pipelines hosting and execution service can help. You write the code, we take care of deployment and provide a fully-managed & secure infrastructure to run it. You can monitor your executions and finally download the output and reports for further analysis.
Providing such a service in a multi-tenant (or even a single-tenant) environment is not easy. We need to support highly concurrent workloads w.r.t. both development and execution of pipelines. In this post we will talk about our pipeline deployment and execution architecture to cater to the required throughput and concurrency.
The deployment of pipelines is often the most boring part of its development phase. It involves compiling or packaging the code, building and pushing execution containers, configuring environment variables and registering or updating the pipeline’s metadata with the orchestration framework. It is a multi-step process that is automated in most production environments.
In this section we will describe how our unique deployment strategies enable teams to scale both the number of pipelines and the number of contributors without being choked by operational blockers.
Polly Pipelines gives us the ability to host all our pipelines in a central pipelines repository. This pattern can be replicated for any number of organizations, although we have not done it outside of Elucidata yet. Enterprises often adopt a “monorepo” approach to help centralize code governance as well as maintain common utilities & coding standards across pipelines. However, such repositories come with a major challenge – coupled deployment of pipelines.
At any given point of time, different pipelines in the repository can be in different stages of development. If we deploy the repository’s code because Alice’s Pipeline-1 has finished adding a new feature while Bob’s Pipeline-2 is still testing out a new feature, then we have Pipeline-2 being deployed to production with unverified behavior. This is non-ideal. Trying to manage this situation through people processes, often results in deliveries of one pipeline being blocked by another.
This is why we have carefully designed our continuous-deployment workflow such that each pipeline’s code (along with its dependencies) and container image are deployed in isolation from each other. If a developer makes changes to a pipeline and commits it to a live branch, then they are presented with a prompt for approving the deployment of that pipeline. Therefore, if Alice and Bob make some changes to Pipeline-1 and Pipeline-2, respectively, and commit them to the same branch, then they will be presented with two separate approval holds. Alice will simply approve only Pipeline-1’s deployment, leaving Pipeline-2 unchanged in the live environment.
If all pipeline developers practice this simple approach – approve only the pipelines that they have made changes to – then it eliminates the need for any kind of manual coordination between members of a team or between members across teams.
While we support and recommend local testing of pipeline code before pushing it to the main repository, some pipelines can simply not be tested offline on the developer’s local machine. This is most often due to a pipeline’s resource-intensive nature. Additionally, many teams also need a way to stage a new version of a pipeline in “testing” mode before moving it to production. This pre-production stage is where the pipeline may be used by internal peers or QA teams to make sure that the pipeline’s behavior satisfies the acceptance criteria set by the stakeholders.
To help with these requirements, Polly Pipelines' registry supports three stages for each pipeline. The repository has three live branches, one for each stage.
Merging a pipeline’s feature branch to the develop branch deploys the pipeline in “dev” mode. When executed, a pipeline in “dev” stage will be run on the development environment. The development environment is completely isolated from other environments. It supports small-scale execution of pipelines in a true cloud environment. It is ideal for testing the pipeline behavior during development.
Merging a pipeline's feature branch to the staging branch deploys the pipeline in “test” mode. When executed, these pipelines will run in a “test” environment, once again isolated from other environments. This environment is useful for providing a pre-production replica of the pipeline for verification by peers and QA team members.
Finally, once the team is confident that the pipeline is ready for production, they can merge their feature branch to the master branch, which will deploy the pipeline to a main production environment. This environment, while isolated from the dev and test environments, has all the compute capacity that Polly Pipelines has to offer, which is what we will discuss next.
Although bioinformatics pipelines are specialized and adapted to the unique demands of biological data analysis, they can still be considered as a type of ETL workload:
We are constantly innovating at each phase of the ETL. This section describes how we are solving the most challenging issues encountered by practitioners when running their bioinformatics pipelines.
One of the bigger challenges of the data extraction phase is the migration of data from different sources. Each source will have its own authentication mechanism. Additionally, a pipeline that is run often for different customers may need to use different source of the same type (e.g., different S3 bucket) in each run. Managing the code and authentication logic for so many disparate sources for each pipeline is a minor headache for the developers.
Fortunately, Polly Pipelines comes with the ability to automatically import data from a variety of sources and make it easily accessible to your pipeline runs. In short, the user flow looks like:
aws s3 cp
on the path (passed in the place of the file or folder parameter).This approach makes writing pipelines a lot simpler for bioinformaticians. They no longer have to manage dozens of secrets and authentication mechanisms. They can focus on the transformation processes of their pipeline instead.
For pipelines where multiple (potentially large) files are needed, we enable data import on a fleet of serverless workers. This greatly reduces the amount of time it would take to import large batches of files or folders.
As of this writing, we are starting with Amazon S3, Polly Workspaces and Illumina BaseSpace as our first few sources. However, the Polly Pipelines architecture ensures that we are agile in bringing support for new sources.
Most data pipelines are defined in a DSL or framework specifically crafted for that purpose. One such DSL is Nextflow. It has become increasingly popular in the scientific community due to its flexibility, scalability and ability to streamline complex computational workflows. For this reason, we also chose to go with Nextflow as our primary executor for all bioinformatics workloads.
However, since then, we have realized that not everyone will want to use Nextflow for their needs. For example, within Elucidata, we have data engineers who need to automate secondary or tertiary transformations & loading of processed data. They prefer a python-native stack and do not need any bioinformatics specific utilities. To cater to their use cases, we created our own pipeline executor called PWL (which stands for “Polly Workflow Language”) written in Python.
At the same time, we often talk to customers who have already written and verified their data pipelines internally. Their choice of the workflow language could be something else like Snakemake or WDL but we still want them to benefit from all the features and scalability that Polly Pipelines has to offer.
Very soon, it was evident that to support the diversity of skilled professionals we will need a multi-framework platform. Therefore, we have carefully crafted programmatic and graphical interfaces without letting one framework dictate any of our choices. These interfaces will allow us to bring in new pipeline executors based on the demands of customers.
UX patterns are a neglected aspect of scalability. We firmly believe that providing a unified interface for running pipelines in any executor is the best way to serve the diverse and rapidly evolving field of bioinformatics.
Modern bioinformatics projects often deal with massive datasets, such as genomic sequences, proteomic data, and large-scale imaging data. Scaling out allows the distribution of these large datasets across multiple computational nodes, facilitating efficient processing and analysis. This scale out can happen on any given axis based on the pipeline logic (most commonly “tasks” or “processes”). For example, transcriptomic datasets often come with multiple samples. The processing of each sample can be pushed a separate compute node, thereby significantly reducing the overall time required to complete the analysis. This is crucial for time-sensitive research.
Polly Pipelines automatically implements this scale out pattern based on the definition of your workflow. Any tasks that can be run parallelly will be run as such without the pipeline developer having to specify anything other than how compute intensive their tasks are. A child task only moves forward when all its parent tasks have finished executing. So the infrastructure is built in a true MapReduce fashion, making the most effective usage of available compute.
And since we are availing compute from large cloud providers, our reserve is potentially limitless. However, in practice we do put certain restrictions on the number of tasks that we run parallelly. This is purely a cost-control mechanism and we will increase this limits should we observe that workloads are reaching our maximum capacity.
To give a sense of the current scale of our infrastructure, we have estimated that Polly Pipelines can successfully process 5k GB-samples of transcriptomics data per week through our custom RNA-Seq pipeline, which uses kallisto for (pseudo-)alignment & quantification. In simple terms, this means we can process 5000 samples weekly where each sample is 1 GB in size.
Apart from raw compute power, we ensure that each executor has its own priority queues and execution clusters. Not only that, each stage of a pipeline gets its own execution cluster as well. Pipeline developers can expect each of these clusters to exhibit similar scalability semantics.
Bioinformatics pipelines have high storage capacity needs due to the nature of the data they handle and the computational processes involved. Each process may itself generate more data for the next process to be used. Moreover, this I/O happens on “local” storage of the nodes participating in pipeline execution. It is difficult to estimate how much storage a given pipeline run needs. It is difficult still to provision disks dynamically based on each run’s demands. So jobs frequently used to fail with out-of-disk errors. We therefore had to come up with a new design for the storage attached to compute nodes.
All compute instances in the cloud come with block storage of some kind. Either it will be a disk directly attached to the instance or a volume connected over the local network. These volumes have a fixed capacity. If a process continues writing more data to the volume, it will eventually reach the volume’s capacity. Normally, we would expect the process to fail at this point. However, if we are running our process on Polly Pipelines' infrastructure, the user will never really find out whether they ran out of capacity. Internally, as soon as the instance’s disk space is exhausted we switch to a remote volume (using the NFS protocol). This remote volume can store an extreme amount of data and our processes effectively never run out of disk space.
The caveat here is that the NFS volume has a maximum throughput limit. If too many nodes are writing data concurrently to this volume, we may see slowdowns in pipeline runtime. To counter this effect, we simply provision more NFS volumes based on the load on our execution cluster. Tasks or processes from the same pipeline run, however, do share the same NFS volumes.
Data on disk is not the only produce from pipelines. Any non-trivial pipeline will have multiple tasks that execute in parallel as well as in succession. Each successive task needs the output from its parent task. So we need a data-exchange layer for storing this intermediate data. Finally, once a run completes, the output is generated which also needs to be saved durably.
For both of these needs we use Amazon S3. Intermediate data can be written by hundreds of tasks simultaneously, totalling terabytes in size. Most traditional storage devices will not able to handle this kind of write throughput. The same goes for reads as well. Amazon S3, being a distributed file-system, has no problem under this kind of load. It is built precisely to handle large number of concurrent read and writes. Moreover, it provides eleven 9s of durability guarantee.
Similarly, we also store the pipeline’s final output in S3. But based on the instructions of the pipeline, the developer is free to upload that data wherever they want and in whatever format they want.
So far we have never faced an issue while using Amazon S3 for any of our pipeline runs, in terms of concurrency as well as durability.
One final aspect of Polly Pipelines worth highlighting is security through network isolation. Bioinformatics often involves dealing with sensitive data. It is imperative to make sure that pipelines running on the cloud, especially in a multi-tenant systems, are operating with strict access controls in place. Therefore, we make sure that:
We also understand that there are organizations that deal with PII data and have to follow stricter security policies. We recommend them to go with a single-tenant (enterprise) deployment of Polly Pipelines. This way, their data and their compute is completely owned and governed by the organization.
There is still much work that we want to do to make sure that Polly is the best place to develop and run bioinformatics pipelines. A few ideas that we are currently entertaining are:
In this post we have shown how Polly Pipelines is a versatile, scalable, and developer-friendly platform for writing bioinformatics pipelines. Its rich feature set and attention to every stage of a pipeline’s lifecycle makes it a good choice for bioinformaticians. Reach out to us at info@elucidata.io or book a demo to learn more about our pipeline solutions.