On workflow engines and where Airflow fits in

With the occasion of the CrunchConf 2018 there was a presentation on “Operating data pipeline using Airflow @ Slack” from Ananth Packkildurai. If you don’t know what Airflow is, it’s an workflow engine of the similar likes of Oozie and Azkaban. It’s based on the concept of a DAG which you write in Python and execute on a cluster.

As in the case of the Kafka presentation by Tim Berglund, we’ve asked the hard questions and they got popular pretty soon. In the case of Airflow, in the eco-system of workflow engines, we had quite a heavy question.

The question was:

“Airflow seems limited to Python only as an DAG expression language. Why not using a generic pipeline as code workflow engine, an idea started by the continuous integration & deployment community? (eg. Jenkins Workflow, GO.CD. Travis, etc.)” – c0d3 guru @ CrunchConf 2018

If  you take a step back, and look through the history of how these DevOps ideas came to be, the idea of automating building and then deploying an application (taken generically, a piece of software) has spun-up from the movement of continuous integration & deployment. Going back in history a little more, it was a need to ensure that commits against a repository from one developer don’t mess-up the commits from another developer and the application integrates, deploys and behaves nicely.

From that “integration” to “deployment” it was basically the same recipe, but looking at applications as puzzle pieces of an infrastructure that need to integrate with other applications in a continuous and repeatable way. So what the likes of Jenkins, Travis, GoCD (which we use) and others was to allow a way to configure “jobs”.

At first, these jobs were made through an UI. But with the advent of GitOps (which one can classify as an effect of the CI/CD movement) the idea of “pipeline as code” was born, initially in Thoughtwork’s GoCD (declarative YAML or JSON only) if I have my history right and later in Jenkins/2 (both declarative and scripted on a Groovy DSL).

I’m a fan of pedigree. I don’t have a dog or any other type of animal but I try to keep it real when it comes to technology, studying history before the awesome blog articles sporting the new technology use. So if I’m to look at Airflow’s first release, Google tells me: “March 19, 2017. The first official Apache Airflow release (1.8.0-incubating) is out!”.

It’s 2018 as I’m writing this article and BAM! just an year ago and a few months this uber-popular workflow engine just released. What makes it so popular? I don’t know, after looking through its current 2018 list of operators. The fact that you can write a DAG in Python? The dependency management? What else is new and differentiating to add “another tool” to the Big Data landscape.

In my humble opinion, it’s a re-invention of what tools like Jenkins, Travis, GoCD and pretty much any workflow engine out there already support. Some Airflow enthusiasts also agree, arguing on the other hand that the “old” pedigree workflow engines are too static. I would disagree.

Pedigree workflow engines like Jenkins/1 have been released in 2011. The Jenkins/2 with the Pipeline (as code) plugin in 2016, an year before Airflow. GoCD in 2014 and sports the pipeline as code plugin for 2/3 years now since we’ve used it in production.

As an user of a generic workflow engine, I will admit I don’t feel much limitations. At most there’s a limitation in GoCD lacking scripted pipelines which Jenkins/2 has, but we got around that by installing Groovy on the agent and writing our own (sure, a Jenkinsfile looks more clean).

What I feel on the other hand is that people from the Big Data domain tend to “use their own tools” and so they end-up recreating much of the technologies that already exist, yet are not adapted for their use-case. And instead of contributing something back they completely wrote a new one, with operators destined for most Big Data specific jobs but lacking other integrations, without some considerable pain.

I’m not saying Airflow, Azkaban or Oozie are something bad as they’re workflow engines destined most specifically at automating ETL flows. I would however like to see more integration and concentration of efforts on the already existing/established workflow engines with plugins to support specific E* jobs instead of recreating and learning the same lessons that these pedigree engines have been through.

To come back to the conference question, Ananth Packkildurai was very frank in answering the question saying that it fit their model of work, which is a valid point if you have people already versed in it. We did the same, but on the other front. If you already have a cluster of machines acting as an workflow engine (eg. Jenkins, GoCD) that can automate different tasks in the infrastructure, I’m not really sure adding an extra layer of complexity will help for the gains it provides. We did not.

In our case we benefited from deploying an ‘crons’ oriented cluster of GoCD machines and using that to schedule scripts in an “serverlessGroovy execution engine that accepted scripts over HTTPS, exposing managed objects with the “native” APIs to different systems in the infrastructure (Hadoop, Presto, Spark, etc.).

How about the other features of Airflow, such as backfilling? We just kept script state in an ZooKeeper cluster, allowing the script to read its configurations from the ZK cluster on an /configuration znode that stored a simple JSON or YAML. Why that? Because it was easy to secure that on a per-znode policy to specific “owners” of the flows and easy to edit with any of the web-UIs for Zookeper (eg. zk-ui).

You don’t need a full-blown Airflow, Azkaban or Oozie to operate ETL flows in the infrastructure, either processing it or moving data to/from Hadoop and other systems. You need a fresh view on architecture, reusing what the guys in DevOps/system’s engineering are already using and standardizing the data processing workflow a bit. We already have much of the tools. Other than that it’s full steam ahead.

In our case since we already had the CI/CD clusters, one for building the images and one for deploying them to the infrastructure, it was just a matter of reusing what we already knew and defining an “crons”-scoped GoCD cluster to run our periodic jobs. We kept the “dynamic” part in ZK /configuration nodes and had scripts be made aware of their state in the ZK /script/state/sub/path they’ve been assigned to.

Why that? Because in our context of few people and high demands (which seems similar to every other project out there) this flow allowed for zero learning curve (with the team already knowing how to do pipelines as code) and maximum flexibility (as we controlled everything from script code, being able to script in any JSR-223 compatible language, eg. Groovy, Scala, Javascript, Jython that our “serverless” micro-service supported).

Keep it simple and reuse software that’s already fit for the job but mostly for the people you already have. In doing so you’ll lower the barrier for entry, you’ll make things simpler and easy to understand for the “plain X developer” where X is the language and you’ll benefit as an architect or team lead from the greater distributed execution (scatter) of the stories/tasks that need to be implemented to reach (gather) your goal. You’ll go against the buss factor that is a killer for most organisations.