Unicorn data engineers & scientists, a guide to catch, keep and sh*t rainbows

This year at CrunchConf 2018 there was an interesting talk by Andrey Sharapov an Data Engineer & Scientist at Lidl. Yes, Lidl. The store in your back alley or in your neighbourhood. Did you know it does Big Data? I assumed, yes, given one wants to optimize both the idea of minimizing waste and increasing profits (eg. how much of X do one store needs to order to ensure it’s gone by EOD).

Andrey’s talk was centered around “Building data products: from zero to hero!” and I would personally want to apraise the realism of his presentation which gives me content for more than one article on the subject. He’s one in a series of presenters at this year’s conference that has called out to the strategy of companies of investing too much in data scientists, then finding out they don’t have an infrastructure those scientists need, then trying to find data engineers a bit too late in the game (which are even more scarce than scientists).

As you can see from the featured image of this article, he also had a good sense of humour fitting two data scientists in an unicorn costume. However, he has a good point that I want to take and expand not only to data scientists but also to data engineers.

The question begging to be asked is: “Why are companies expecting data engineers and scientists to be unicorns?”. For one, I tend to believe it’s because of the scarcity itself. Companies spend soo much time finding “the perfect candidate” that when they get one they expect him to do miracles. Some do, some don’t. Most don’t. But those who don’t are mostly backstabbed by the context in which they operate (lack of resource, lingering history of bad implementations, no team to work with or politically bogged down in useless minutia).

Indeed, if you look at a data engineer’s profile, a perfect candidate is the one that has good knowledge of the latest & greatest Big Data-oriented software (Hadoop, Spark, Elasticsearch, Cassandra, Mongo, Couchbase and anything that has an “Apache” prefix attached to it).

This candidate also needs some Systems Engineering knowledge, at least in the idea of continous integration (to build the images of software packages like Spark with some random Python library such as numpy or scikit) but also deployment (eg. Spark requires the hdfs/core-site from Hadoop, so one should know what environment variables are and how to start Spark with those to obtain “Hive” support in Spark.

Now if you look at data scientists, these people are 60/70% of the above plus a PhD thesis in mathematics. To pseudo-quote, from memory, another data scientist gone engineer, Nate Kupp (“Director of Infrastructure & Data Science @ Thumbtack” at the time of writing) who also presented Thumbtack’s “From humble beginnings: building the data stack at Thumbtack” who said something along the lines:

“I would consider a data science problem something that a company finds as a lingering problem to a stable product revenue, but finding hard to increase sales. A problem that only scientific analysis of the data can reveal. If a company is going down or a product is not selling that is not a data science problem, more it’s a product or marketing problem that science can’t fix.” (quote reproduced from memory, while talking over the lunch break).

Are data engineers and scientists unicorns? No! They’re people, hard-working ones as it takes a good toll of their efforts to understand all the diversity, quircks, patterns of the technology landscape that they need to use in order to solve these problems. But in the end, they’re people, which assembled in a team of similar engineers & data scientits working on a product, are they able to rise.

Not left alone in some random corner of the office with a Jupyter notebook or IntelliJ IDE, awaiting them to “come up with that million dollar idea” lacking any support from the product owners or resources to do their job easily (a.k.a. infrastructure for data scientists which the data engineers, who act also as systems engineers, have to solve) and treated in isolation, on the idea that unicorns are rarely found in nature.

To conclude I would say these, first of all, people who look and act like unicorns are NOT unicorns. But that’s the image the companies and HR departments put on them, because they’re few in numbers and rare in between. What has to change here is the mentality of management/HR in regards to these people.

And I would advise the later management: got an unicorn (engineer/scientist) in your team? Shut up and listen wisely! I will not say blow a million dollar on this mad scientist or engineer’s idea, but follow a lean approach to management and do small experiments with his/her help and test the revenue impact they can bring, learning from them also on how to “scale” this experiment for your business.

It may take some time but given your unicorn and a small product and technical support team built around this fantastic wilderbeast known as a “data engineer/scientist unicorn” you may be, literally, sh*ting rainbows of cash if you let them articulate your revenue problem against the inflow of data that hogs your operational expenses budget (and from my experience, most will recommend you delete 80% of your data cause only 20% is relevant to your business model).

Why am I saying this so crude and direct as to shut-up? Because most engineers and scientists given the leverage to side-step a little in the other domains, either of engineering or mathematics, or of systems engineering/DevOps, product ownership and people management will transform themselves, magically, to an even more rare Pokemon known as a data architect, able to coordinate your DevOps, scientists and engineers towards the goal of reducing the expenses on data storage & processing with a return on investment of putting that flow of data to work.

One Comment

Comments are closed.