Data engineers and their unlocking potential for business use-cases

IMG_20181030_141321Nate Kupp currently holds the position of Director of Infrastructure and Data Science at Thumbtack and has presented this year his talk and success story entitled: “From humble beginnings: building the data stack at Thumbtack”. This is one of the presentations I’ve enjoyed much because it was similar to one of the pains I’ve also experienced in my day-to-day work.

A difference between Nate’s approach and mine is the executive sponsors (and a bit of luck of being in the right place, right time and the right management mentality). My experience on the other hand is, from my perspective a failure, but for others a small success against overwhelming odds.

What I found extremely true in Nate’s own presentation and Thomas’s (Thomas W. Dinsmore) presentation called: “The Path to Open Data Science” (or AI-enabled organizations) was the reality in which they called out to companies to stop hiring data scientists without having a good ratio of data (and systems) engineers to sustain that infrastructure.

To rephrase Thomas’s conclusion, data-science is a particular problem of its own. You need a big amount of cores, for a short period of time. On-premise infrastructure unless heavily virtualized or containerized cannot amount to that degree of scalability up or down. It requires some kind of human intervention to set-up and free those resources.

To put that idea along-side Nate’s argument that companies hire data scientists, which join the company, expect some toolbox, but only to find out there’s no infrastructure and the standard set of tools (Hadoop, Spark, Presto, etc.) only to quit in frustration that they can’t do their work 1 or 2 years later.

I’ve also seen this happening. My team used to develop and expand the data infrastructure for the employer I was working on. In total I could say we were 3 data engineers to a company of 2 data scientists per product team. We had 20+ products. The data scientists were doing “machine learning” (if you can call it that) on their workstations, but because they couldn’t fit all the data, they were most times agresivelly filtering the data down or working on a subset (of the time window).

In terms of putting some data science or machine learning project to production, this was a practical imposibility as there weren’t sufficient engineers to tackle that. In this situation is funny to see companies accumulating all the freshly converted former data analysts now data science people on their pay-rolls without adequate engineering support.

To give a sensible ratio, a 2:1 data engineers to data scientists is what I find normal in a company strategy. Fact is that 2 data engineers allows for a continuum (against the buss-factor) while they provide both support, to the data scientists but also evolution (of the toolset). The ratio can vary in order to reduce costs a bit but to simplify I wouldn’t go anywhere below the 1.5 engineers/scientist. I can’t base these numbers on anyting else than gut and experience.

During the conference and following lunch, we got a chance to stay longer chatting with Nate. One of the good ideas that spawned off the conversation was that he believes (and I do strongly also) that data science is necessary when the product revenue is stable but you can’t find a way to improve it, so the data needs to unlock your potential by identifying patterns (automatically) that are not easily visible to the naked eye of the business owner following some charts.

However, on the other side, Nate (and I do also support the idea) said that if the product is not stable, if it varies going down especially, that is not a data science problem, but a marketing or product problem. Which I totally agree. You do need a stable baseline onto which to “metric” the impact of data science against, not a hugely varying revenue that can’t be attributed to any ML algorithm doing a better job. And in fact this is just a simple application of lean management of experimentation and measurements with the comment that measurement should be against a stable baseline revenue, to be able to draw a success story.

To conclude, in CrunchConf 2018 I was happy to see two presentations from two different people, with different backgrounds, sustaining the same idea of data engineers as people that can empower both the data scientists and the business, basically as the missing role that can tie together a management request into a coherent data pipeline that serves that purpose, geared or advised by the data scientists tasked with solving a complex problem.