green, red, and white high voltage circuit breaker

Automate code-review chores with a simple Dangerfile

One of the most annoying chores in code-reviewing is having to say: “You forgot to [unwritten rule]”. It feels arcane, it annoys the code-reviewer and the guy doing the work, it builds frustration, sparks useless arguments, some edging on the brink of flame wars. Well, not unless the rules are written, the rules themselves codereviewed and agreed upon in the team. One idea to automate-away some of the most common concerns and focus on the interesting logic is to automate the chores away using the Danger bot.

In one work situation, I had the possibility to implement the first pipelines the team has used and to provide some standardization going forward. Our Git-flow evolved around 3 branches (integration, dev and master) with the direction of code flowing being from left to right, passing through QA as it went from integration to “development” branch (tied to the specific cloud account) and on to “master” from which release branches were triggered, later to be deployed on the production cloud account. The existence of the “integration” branch basically is to check-in the code somewhere until we are ready to “cut the cord” and release a new version for QA and extensive development testing (with custom data and a sample of anonymous production data).

Continue reading →
An open empty notebook on a white desk next to an iPhone and a MacBook

Orchestrating notebooks with Camel K, an integration-based approach

When you achieve somewhat of a level of maturity in your data analytics pipelines you also tend to start exploring various and flexible ways to orchestrate the ETL processes of data you have and derive various tables for different access patterns, as required by the business downstream. However, similar to the “object impedance mismatch” in the object vs. relation database worlds, there’s an “impedance mismatch” between data-engineers and business folks when it comes to the expectations they have on the speed of delivery, quality, correctness, maintainability.

In the mindset of the business folk, the derivative work or data that he requires is just a simple SQL that is run on the “raw data” which has infinite amounts of CPU power, and infinite amounts of memory, probably running on GPUs anyway and thus just needs to be written/queried as such and it will return results in an instant.

Continue reading →