The story of Nextflow: Building a modern pipeline orchestrator

An in-depth insight, and learning opportunity, about the importance of data-driven computational pipelines for orchestrating scientific workflows.
Labs
  • Views 2,078
  • Annotations

By Paolo Di Tommaso, creator of the Nextflow project, Co-Founder and CTO at Seqera Labs

In the early days of the Nextflow project, none of us could have imagined what it would become today. Since its humble beginnings, Nextflow has grown from a basic tool to one of the most widely used solutions for orchestrating scientific workflows. Today Nextflow has vibrant user and developer communities, is downloaded 55,000+ times monthly, and is used by over 1,000 organizations, including some of the world's largest pharmaceutical firms. I thought it would be interesting to share the project's history, the challenges we encountered, and our vision as we continue to take Nextflow forward.

I should start by explaining that Nextflow is a workflow engine and a language that enables scalable and reproducible scientific pipelines. While this probably sounds like a collection of "buzzwords" to some, it turns out that data-driven computational pipelines are critical in multiple fields. Data-driven pipelines are at the heart of applications from genomics to personalized medicine to next-generation oncology to genetic surveillance.

The early days

Before my involvement with Nextflow, I was a research engineer at the Cedric Notredame Lab for Comparative Bioinformatics. My job at the time was to help researchers run their workloads more efficiently on in-house computing clusters. While tools existed for managing bioinformatics workflows, most of our pipelines were developed in house using Bash and other scripting languages. There were challenges with this approach:

● Scripts were complex and usually understood only by their authors, making enhancing and maintaining workflows challenging.

● Workflows were buggy and error-prone: imagine kicking off a long-running pipeline, launching thousands of jobs, only to have it fail after 10 hours of execution and needing to restart it from scratch.

● When workflows ran, it was hard to track progress. Without monitoring tools, we found ourselves constantly using the Linux command line, 'tailing' files and 'grepping' jobs to get a sense of where we were.

● Finally, the workflows were tightly tied to the compute environments. Even small changes to the environment could cause pipelines to break.

In other words, early pipeline processing was an utter mess. Installing a pipeline could take weeks of effort, requiring the configuration of obscure pieces of software, the use of bizarre programming languages and compilers, and troubleshooting missing libraries and components. One needed to know arcane environment variables and command line options passed among PhD students as a matter of ritual.

Like many open-source projects, Nextflow started as an effort to solve these real-world problems and avoid these frustrations. It was hard to believe it could not be done in a better way.

Some big ideas fall into place

Dealing with some early failures helped me recognize the value of critical new technologies. Above all, I was fascinated by the large use of frugal programming approaches such as Linux shell and Perl scripts and their impact on bioinformatics. It was great for quickly prototyping data analysis scripts, at the expenses of the portability and replicability of the resulting application.

Nextflow was born with the aim that researchers should be able to continue using their favorite programming languages and tools, and scale to local compute clusters or deploy in the cloud without having to change their applications.

The core idea was to keep the workflow tasks segregated from each other, or said another way, treat them as self-contained "black boxes" with specific inputs and outputs. This makes it possible to run tasks independently of the underlying compute environment. Moreover, this approach is inherently more parallel and scalable, because the lack of common shared state avoids problems such as race-conditions on concurrent file access common with other approaches. By caching the results of computations, workflows became recoverable in the case of failure, avoiding the need to re-run steps previously executed.

Consequently, the developer can view a pipeline as a collection of tasks, where the execution is orchestrated by the framework itself. This gave rise to another important idea behind Nextflow – the dataflow paradigm. Dataflow is an elegant programming model that allows the definition of tasks that execute in parallel in a declarative manner. A good way to understand this is to imagine tasks in a Nextflow workflow as cells in a spreadsheet. When a cell is modified, the change is propagated automatically to all dependent cells, which carry out new computations and so on. In other words, this approach allows the definition of a network of tasks that are activated by exchanging messages – a solution that works surprisingly well at scale.

Containers were another particularly important idea that influenced the Nextflow design. Containers made it possible to encapsulate pipeline dependencies into pre-configured executables and packages that could be downloaded on demand. Today, containers are a well-established industry technology, and also an essential component of modern data analysis pipelines. However, at the time the Nextflow project was started, containers were an obscure technology relegated to a few Linux hackers. I knew little about this topic until I saw Solomon Hykes’ talk about Docker in 2013. These were likely the five minutes that mostly impacted the future of Nextflow. I realised immediately that containers would be a critical technology to enable the reproducibility of computational workflows.

As DevOps techniques revolutionized software, it became clear that the same source-code management (SCM) tools would be used to collaborate on pipelines. It only made sense to integrate with modern SCM tools. By doing this, a Nextflow executable pipeline became just a GitHub repository, and a pipeline revision just a Git tag. This realization allowed us to track any changes in all pipeline assets, i.e. application scripts, deployment configurations, and required dependencies. It also made it possible to replicate the execution of any version of a pipeline at any point in time with a single command.

Nextflow allows individual workflow steps to be written in a user's language of choice. However, the workflows themselves are written in an expressive, domain-specific language (DSL) optimized to address the unique challenges of orchestrating large-scale workflows.

Nextflow takes off

Nextflow was first released on GitHub in March 2013, where it attracted a small but loyal community of users and contributors. As we supported additional HPC workload managers and clouds, our user base expanded. Nextflow's support for modern cloud batch services made it easier than ever to tap cloud resources, fueling still additional growth.

The project gained further momentum in early 2017 when a parallel community effort called nf-core and led by Phil Ewels was established. The nf-core project resulted in multiple groups and research institutes collaborating to develop and share high-quality, curated pipelines written in Nextflow.

Nf-core is likely the project that I’m most proud of. The nf-core community embodied the core values and vision of the Nextflow project from the time it was started: enabling researchers to collaborate on writing scalable data analysis pipelines that can be tested, shared, and deployed across many different organisations and infrastructures with ease.

By 2018, we realised that beyond the pipelines themselves, users had unmet requirements related to pipeline automation, secure collaboration, infrastructure orchestration, and making pipelines accessible to non-IT specialists. It was clear that we needed to "think bigger" to meet these needs.

Enter Seqera Labs

Seqera Labs was launched in July 2018 as a spin-off from the Centre for Genomic Regulation. Seqera Labs attracted seed funding in February 2019, enabling us to scale our development efforts and improve the Nextflow ecosystem substantially.

Our first commercial product, Nextflow Tower, was launched in September 2019. Tower is built to address the critical needs of enterprise users, providing them with a seamless environment for launching, monitoring, and collaborating on workflows across multiple cloud platforms. Seqera Labs continues to enhance Tower with optimized deployments, team management, and cloud-budgeting while maintaining an open core model.

Production proven

The ongoing COVID-19 pandemic has underlined the importance of scalable pipelines. Early in the pandemic, the capacity of sequencing centers and public health authorities was rapidly overwhelmed. Public health authorities quickly collaborated to develop containerized pipelines using Nextflow and related technologies to support global SARS-CoV-2 sequencing and surveillance efforts. By publishing portable Nextflow pipelines that could run across any infrastructure, public health authorities could scale their capacity to sequence COVID-19 samples rapidly. This helped them dramatically accelerate surveillance capabilities, identifying and tracking the rapid evolution of variants, including Alpha and Delta.

Conclusion

Today Nextflow has a vibrant community and has become one of the most widely used solutions for orchestrating scientific workflows. After nearly 10 years, the project development is thriving.

Recently, Nextflow gained a new “plugins” system enabling community developers to provide their own components and extend the framework’s core functionality in areas such as access to SQL databases.

In the future, we’d like to make Nextflow even more scalable and interoperable with machine learning and cloud technologies, but keep it firmly anchored to the project’s original vision and core values; among these, simplicity over complexity, pragmatism, delivering high quality, robust technology solutions with a human touch.

The majority of the credit for Nextflow goes to Cedric Notredame for giving me the opportunity to work with freedom in a creative context. Thanks Cedric!

Learn more about Nextflow at nextflow.io or visit seqera.io.

-----

We welcome comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.

Do you have an idea or innovation to share? Send a short outline for a Labs blogpost to innovation [at] elifesciences [dot] org.

For the latest in innovation, eLife Labs and new open-source tools, sign up for our technology and innovation newsletter. You can also follow @eLifeInnovation on Twitter.