An open empty notebook on a white desk next to an iPhone and a MacBook

Orchestrating notebooks with Camel K, an integration-based approach

When you achieve somewhat of a level of maturity in your data analytics pipelines you also tend to start exploring various and flexible ways to orchestrate the ETL processes of data you have and derive various tables for different access patterns, as required by the business downstream. However, similar to the “object impedance mismatch” in the object vs. relation database worlds, there’s an “impedance mismatch” between data-engineers and business folks when it comes to the expectations they have on the speed of delivery, quality, correctness, maintainability.

In the mindset of the business folk, the derivative work or data that he requires is just a simple SQL that is run on the “raw data” which has infinite amounts of CPU power, and infinite amounts of memory, probably running on GPUs anyway and thus just needs to be written/queried as such and it will return results in an instant.

In the mindset of the data-engineers however, complex things like partitioning (range, hash-based, etc.) or compression, storage/life-cycle, cost profile, SLA (usually freshness) tend to creep in making that “simple SQL” that the business folk envision, a lot harder. Plus it needs to be done yesterday! This drives most data engineers to search for easy ways to write snippets of code, preferably directly on production, read-only to the raw data but write allowed to a “staging” bucket where they can explore various techniques into preparing the data as required.

Then comes the complexity of orchestrating the finished ETL pipeline in a complex choreography of other competing pipelines, for either resources, access to data in a given bucket, concurrent writes (some of which made simpler by Delta and Hudi), etc. Just imagine you have DS1 and DS2 that need joining but you should join only when both have been updated and there is a “delta” (difference) between the previous and new snapshots of both datasets which you then need to JOIN with another JOIN from a different DS3 and DS4 and DS5 to form the final result, all while managing speed, cost, correctness and other quality asked-upon by your fellow business peers.

Basically, writing ETL pipelines that work under a specific SLA, guaranteeing all aspects of data quality (completeness, correctness, freshness, etc.) is a more than a data-engineering challenge but also an infrastructure one, which tends, with high volumes of data to get into complex scenarios where business reporting needs must be met, while respecting a cost profile and a maintenance and support level that people are accustomed with. Doing this the traditional way (of writing the code, building a JAR and taking it through various processes upwards to production is really getting in the way of exploratory techniques). While I do advocate for slower but more stabler approach of orchestrating actual jobs while they pass through QA/security clearances, for most parts of the business, those come as a 2nd if not 10th concern on their priority list.

In recent years data engineering has leaned towards the concept of notebook-based ETL as started for example in Netflix and others, where the idea is that you interact with a notebook in the exploratory phase and you prepare the code there until happy (of the likes of JupyterLab or Zeppelin) which gives you access to distributed processing engines such as Spark (actually called MPP databases, MPP being the acronym for Massively Parallel Processing) and then without having to migrate the code anywhere else, you “orchestrate” that notebook against the production data.

Some big cloud providers, such as AWS have also stepped up and since circa 2018 (around the same time as Netflix) have launched support for EMR Notebooks and most specifically the programmatic execution of the notebooks which is from my perspective (as a data engineer) of great interest. Funny enough, in 2016 I was doing the same by glue-ing these pieces together myself using various techniques, I’m happy to have 80% of the work done for me, while I focus on the ETL goals themselves and not on the infrastructure complexity to manage the processing chain, investing the other 20% to account in the pipeline itself, to make it more efficient or stable, as needs be.

While the services are OK and coupled with for example additional kernels that you can install on the cluster (using EMR Bootstrap Actions) like for example the BeakerX kernels (or xeus-cling if writing C++ in a Jupyter Notebook is what gets you off) … overall the “notebooks” service seems a little “insufficient” and by insufficient I mean we lack an approach to orchestrating complex chains of notebooks, for which most of us have to write the glue-code ourselves and somehow tailored to our own business needs.

Well, the “insufficient” part in my experience is the lack of an orchestration (or let’s use the term routing and triggering) layer that can be achieved simply, with as little code as possible. Enter Apache Camel, an EIP (enterprise integration patterns) framework I’ve been happily using along-side Spring for low to medium-traffic services, bridging the two worlds as needed. While Camel itself is good for connecting various bits of infrastructure and services having ~= 300 or so components to communicate with everything you could imagine, it’s still a verbose framework requiring to build-up code and pipelines for it to get deployed to production.

So, how can we make this simple? I mean dead-simple. Not long ago, like literally second-half of 2020, the guys backing Camel development released the GA version of Camel K (1.0). Camel K is literally “Camel” (without Spring sadly, but having a “bean” registry if needed embedded in it) that runs on Kubernetes. In short you write something similar to this (in Groovy directly if you wish) and run it:

from ('cron://tab?schedule=0/30 * * * ?')
.process { 
     EmrClusterId emrId = acquireEmr (); 
     triggerNotebookOnEmr (notebookId, emrId); 
     awaitNotebook (notebookId);
     profit (true);
.to ('log:info')
kamel run example.groovy

Now imagine you’d replace the “process” method with a call to a little glue code to (1st) trigger the creation of a compatible EMR Notebooks cluster (EMR +5.18 cluster usually pre-configured with anything you need) which then, after creation, invokes the execution of the notebook associating the notebook to the previously created EMR cluster, awaiting then the output of the execution. Behind the scenes as stated by AWS in one of their blog articles, they’re using Papermill to do the orchestration, but that’s implementation detail you as an engineer may or may not be interested in.

So why not Airflow or some other type of tooling? Why not AWS Step Functions? Well, it comes down to simplicity, the people we already had, the tooling we already used. We already had a production-grade Kubernetes cluster running and integrating the operator was simple, in fact, it was already deployed as part of some other tooling requirements. To trigger notebook execution we required cron-based triggers with custom functionality to check the state of previous notebook flows plus a few other infrastructure state checks before deciding to run or not a derivative computation. Camel already had the cron-functionality as a component and it certainly required less code and less preparation (in terms of IAM roles and deployment) by comparison to anything else.

In short and for us as a Java shop mostly, requiring new tooling meant new stuff to manage (and learn to maintain, learn to work with, keep HA, update, adapt to updates, etc.). I’m not advocating that learning is bad or that you should not learn new stuff, but it wasn’t necessary for our use-case. By comparison Airflow required Python and Step Functions required to buy-in to the vendor’s way of handling things. Given some mileage however when you are faced with writing some medium to complex condition-checking mechanism, the existing tooling falls short of a mature EIP framework where you can leverage the existing connectors (all ~300 of them if you wish) and focus on the problem at hand instead.

In our use-case, the short/low-code approach of Camel K on an existing Kubernetes cluster coupled with the low-code/API of EMR notebooks seemed to suit best the needs of orchestrating complex notebook execution workflows without having to deploy or build-up new tooling, leveraging the routing logic we already had in the infrastructure (given that we were already orchestrating raw data pipelines with the Camel/Spring version already, with a dedicated flow and scheduler/routing logic and have had a sufficiently mature code-base there to rely on).

If you’re coming today out of an enterprise approach to data processing and if you’re having an integration-based mindset in the team, I will recommend trying out the idea above before jumping head-first into other technologies as you may find yourself better-suited by the framework you already use (given a little elbow grease of course) then by generic tools which may lack in sufficient flexibility to meet your needs.