Unicorn data engineers & scientists, a guide to catch, keep and sh*t rainbows

This year at CrunchConf 2018 there was an interesting talk by Andrey Sharapov an Data Engineer & Scientist at Lidl. Yes, Lidl. The store in your back alley or in your neighbourhood. Did you know it does Big Data? I assumed, yes, given one wants to optimize both the idea of minimizing waste and increasing profits (eg. how much of X do one store needs to order to ensure it’s gone by EOD).

Andrey’s talk was centered around “Building data products: from zero to hero!” and I would personally want to apraise the realism of his presentation which gives me content for more than one article on the subject. He’s one in a series of presenters at this year’s conference that has called out to the strategy of companies of investing too much in data scientists, then finding out they don’t have an infrastructure those scientists need, then trying to find data engineers a bit too late in the game (which are even more scarce than scientists).

Continue reading →

On workflow engines and where Airflow fits in

With the occasion of the CrunchConf 2018 there was a presentation on “Operating data pipeline using Airflow @ Slack” from Ananth Packkildurai. If you don’t know what Airflow is, it’s an workflow engine of the similar likes of Oozie and Azkaban. It’s based on the concept of a DAG which you write in Python and execute on a cluster.

As in the case of the Kafka presentation by Tim Berglund, we’ve asked the hard questions and they got popular pretty soon. In the case of Airflow, in the eco-system of workflow engines, we had quite a heavy question.

Continue reading →

On Kafka’s place in the MQ landscape

Just got back from CrunchConf 2018. A good panel of speakers and an interesting conference. Lots of food and drinks. Good atmosphere, helpful organizers. Fun times, good memories. The conference was a blast with most of my questions hitting the top votes with a little help from the community.

I decided in the context of the conference that I will share my thoughts on the presentations, at least for those that were intriguing and for those that my questions got the top votes. All in all, I would like to appraise good presentations, devoid of hype and commercialism. There seems to be some hype in today’s world around the Big Data projects, with the naive jumping ship to the next cool project.

Continue reading →

Going to the Crunch Data Engineering and Analytics Conference, 29-31 October 2018 in Budapest

I remember in 2016 my current employer provided the opportunity to go to the Cassandra Summit 2016 edition in San Jose. An exhausting and long 30-hour flight, tons of preparations with the US visa a few weeks ahead, a booking mistake that I had to pay with my own card until it was fixed and many more “troubles” later, I was finally there.

The thing about some conferences is that not all presentations are put online. In this case for Cassandra Summit 2016, the Datastax community has provided all recordings of the conference presentations but this is not true for most. Which is just nice of them to do for the community as such material can be later referenced to.

Continue reading →

Re-learning to blog

A few years ago I used to blog. I believe it to be 10+ to 15 or so years ago. It was tempting. I was an PHP developer working my way through the inter-webs and it was interesting. Everybody blogged. Blogging was hype. Blogging was awesome. You didn’t have a blog, you didn’t exist. Kind of like Facebook or Instagram now.

In the mean-time, I grew old and I do hope, more mature. Made a family. Have a 1.6 year old toddler. Thinking about blogging at night. But more than that, thinking about sharing experience. Of what I know, what I’ve tried, what’s interesting to follow, what to avoid.

Continue reading →