On data architects and their cost-saving role of fitting management requests in a puzzle of infrastructure and human resources

As stated earlier, Andrey Sharapov’s presentation on “Building data products: from zero to hero!” has given me many motives to talk. And I can’t seem to stop with the ideas of things to write about. Maybe probably because I’ve been, at my current employer, at the time of writing, through the same pains as the guys at Lidl did. And those pains are centered around the management of data people, be them engineers or scientists.

The featured image of this article present a young female manager asking: “How are you?” and getting a cryptic reply: “About half a standard deviation below the mean …” which in some languages goes by as swearing or rude behaviour. But in all honesty, although English, these people come from very different backgrounds, with slight variations of vocabulary (and understanding, not that one is less of an educated person from the other as management itself is a hard discipline also as it involves a few areas of sociology, economics and more).

But back to our troubles here. What I’ve found common in Andrey’s presentation and in the one from Christoph Reininger called “From Data Science to Business Science – How data scientists @ Runstatsic translate stakeholder needs” is the lack of “translators” between the two parties. In fact, to say it out loud, management has a way of not being able to articulate what they want while data science has a way to guesstimate (usign machine learning most probably) what the management wants.

What happens when the above two combine, meaning a lack of articulation power with machine learning-backed guesstimation can be simply translated as wasted time and money, all powered by wasted emails, big beefy machines with lots of cores or if you’re in the cloud these days, an empty wallet. The bottom line? Zero budget for the next product iteration as we just lost a million bucks on this data-science problem that didn’t exist before but it’s happening now.

So where or who are the translators? And here it becomes scary. There isn’t a formal definition of a role in any company called the “data or management translator“. It gets even more scary, as management “wants” something it can’t articulate, data scientists ask engineers for software they guesstimate they need based on management’s lack of articulated need. The data engineers get to work, costing dollars and time, building complex systems around a shadow of an idea, all in the effort of getting management what they think it wants.

And I think here is where the problem is. In data engineers & scientits thinking all that is needed to answer The Question The Management Put Up is just Big Data and ML algorithms, while on the other end of the spectrum, management thinks that all that capital and operational expenses should add up to A Good Answer to whatever the problem with the product is that they can’t quite articulate. No it’s not. Neither have the good picture.

In fact and all honesty, it’s the lack of design or architecture, also common to the realm of requirements engineering. What is requirements engineering? Well, Wikipedia has a definition of that but I like to say it more simply as “putting the engineering process into the design of the requirements and obtaining feedback from management, given the costs, that the proposed solution is this still what they want to pursue?”.

To tell a personal story here, in my day-to-day job I’ve been confronted with similar situations before. It came as a story surrounded as: “We need an user’s profile!”. And work started pushed behind by management with me at the time having little to say. At some point, I started asking: “Who’s for?” and I got a very high-level reply: “The user acquisition guys which don’t even know SQL or very few of it, they need something simple, like a dashboard”.

Fast forward a few months of my naive belief in my own management and seeing the project delayed and getting more complex by trying to make it too simple, dead simple and instantly fast, I went back and asked the question again: “Who is it for?” and got a different answer: “For the BI guys doing the dashboard, backing the UA tool.”.

Waaaaaaaaaaaaaaaait a second. There’s one to build a brain-dead simple tool, but that takes an budget X and time Y, both which need to be infinite, trying to make it as simple as possible with much more effort than required. While there’s another cost for interacting with the BI guys, which in our case were good enough Java & SQL developers working on custom dashboarding for our, well, custom management request, people which new hot to operate an REST or Java API given one.

So what was missing here and before I started to ‘get my grades’ as a data architect, realizing that Conway’s Law was in effect and the architecture resembled the communication patterns of the organization, was the process of requirements engineering, of basically defining the stakeholders, the needs not wants, the SLA/SLOs of the service, the DR plans and more.

In a position to architect a system, it is best to take a step back and check if you have the full picture and in a good level of detail of each puzzle piece and if something’s missing, you may want to not start work on it with your engineering team until you define clearly what that piece of the puzzle of your architecture needs to do. This was one of the lessons I got to learn, the hard way, through much pain.

So basically what I failed to do back then, is to do a bit of translating of management requests to costs of the overall “user profile” project which had to sift through some 1PB of compressed 5 years data to be of any reliability. So we designed for that at a huge time and money cost, lacking the following considerations:

  • management asked for an user’s profile built from our data. We jumped to it and scaled for 1PB of data without realizing that the longest user lifetime in our products was 90 days after which he’d be very sporadically using the product. So in fact the window of time could be at a minimum 90 days but no longer than an multiple of that (2x/3x) which is less than 5 years;
  • management asked for an user’s profile but forgot to say the level of “perfection” they wanted, so we went crazy with millisecond precision when actually “day” was sufficient in the end (and we realized that on our own);
  • management asked for an user’s profile but they didn’t asked “real-time“. Yet we didn’t asked that question either. So what we ended up designing thinking they would want that is a system to do “real-time” + “batch” updates of the profile which turned out to be quite complex, relying on Apache Ignite, through a series of micro-services holding the business-logic piped data through them via REST APIs for batch and MQs for real-time. Crazy if I think of it today …
  • note to say that thousands of lines of code if not tens of thousands have been replaced by 4 simple SQL statements in Spark and writing the data out as ORC files to support later “edits” all done by one of our data analysts with a naive knowledge of Spark, instead of our high-cost data engineer’s time. But this had happened waaaaay to late in the process of catching the opportunity deadline.

While it’s understandable that management cannot know these peculiarities, it is also understandable that the data guys are soooooo excited with their toys and are in the mindset of such complicated and distributed systems that they go on to build the most intricate things, that either fall short of the opportunity deadline or have little to no value. It’s the fault of both worlds and they should try to understand each other better.

So who are these ‘translators’? As stated in the previous article on the unicorn status of data engineers and scientists, these are the data architects, a role/position for which LinkedIn job postings are few if any or totally missing in your country. A new “name” and whizbang high-paying role? No. A need I would say. Why data architects?

Because their role is to overview the simplicity, scalability and good integration of the puzzle pieces of technology of a company’s data infrastructure and governance. Pieces of puzzle which cost money! Given that role, they’re the first line of defense against bad implementations of projects (or even projects alltoghether) by asking the right questions to make the “new management request” fit in well with the existing puzzle pieces or reusing some of the pieces they already have deployed in the infrastructure. Cause what’s why we do software in the beginig, to reuse stuff without much cost.

Oddly enough and sadly to say, the community lacks sufficient training and puts too much emphasis on engineers and scientists, hired in all teams but a dedicated ‘Data Analytics’ department under the rule of one or more data architects assembled in an odd-numbered team, for split-brain scenario, analysis paralysis, dictatorship-avoidance and buss-factor reasons only.

To quote Wikipedia: “A data architect is a practitioner of data architecture, an information technology discipline concerned with designing, creating, deploying and managing an organization’s data architecture. Data architects define how the data will be stored, consumed, integrated and managed by different data entities and IT systems, as well as any applications using or processing that data in some way. It is closely allied with business architecture and is considered to be one of the four domains of enterprise architecture.”

The words here are “closely allied” meaning by this term, the close understanding of management requirements and translation of those requirements to a suite of technology stacks and workflows required to reach the proposed goals of business under the budgeted or expected costs of the solution.

So what’s a data architect? If I’m allowed to give my definition: “A data engineer (or scientists, but it’s a bit harder) of sufficient knowledge in areas of systems engineering, data science/analytics and engineering to apply the formal process of requirements engineering to each and every new management request so that the costs of adapting to or developing these new management requests into a reality is well understood by all stakeholders (management included) before any first line of code is written and before it reaches the excited engineers and scientists which this architect has to coordinate through his engineering & scientists leads to the stated goals.”

I will leave you with the skills a data architect needs to have and will leave you to judge on your own if you have this role in your company today and how much money you’re loosing. And as I’m finishing this article, this Slashdot recently posted article on data scientists continues to emphasize the science, rather than the architecture and process (which also includes the science) of fitting that data in an use-case that serves the people better …