My All-Time Most Viewed Stories
What an amazing story written with such great clarity - a decade of work summarized! It's always exciting to know how different companies approach problems of data engineering. We must all realize that there's no one "right" way of doing data engineering.
Couldn't agree more with Snowplow's observation about measuring the accuracy and completeness of data being a process rather than a project —
"We won’t lie — measuring the accuracy and completeness of your data is not easy and is a process rather than a project. It starts at the very beginning: how you collect your data. And while it never ends, quality grows with better-defined events going into your pipeline, data validation, surfacing quality issues and testing."
Good read about data quality!
Loved this piece - very informative and full of accurate observations about data engineering teams. I have been working on a similar Data Vault / Data Warehouse setup recently. I could relate.
By the way, kudos to the analogy department!
This article talks about some of the major challenges that data engineers face while creating and maintaining data lakes.
I have dealt with five out of the six challenges that have been discussed here, not necessarily using Spark though.
Must read for data engineers!
I wasn't aware of Jonathan Mann's existence but I was following kind-of the same philosophy when creating music. During the last year, I might have completed more than a hundred compositions. I didn't write the songs myself, though. I just composed poems of my favorite poets.
Most of the compositions were mediocre, some were really bad, but there were a few which were pretty good. For me, the ratio might have been 75-20-5, but I guess this might get better if I keep at it. Maybe I'll also start posting my stuff somewhere in a year or two.
Fantastic article! You have walked through the whole process in quite some detail. Many years ago, I did something similar, but the scale wasn't as big. But the difference was that we were not just doing a database upgrade on MySQL; we were also migrating one schema to another (as the app was moving from a monolith to a microservices-based architecture) added another level of complexity to the process. Although those were a tense couple of weeks, we were able to do it quite well by the end of it.
A fascinating read for anyone who loves working with databases at scale!
From Toby Skyes, Global Head of Data Engineering, QuantumBlack.
“There’s certainly more discipline in how analytics solutions are being built, as many are being created with an expectation that the data pipelines eventually scale beyond the initial project scope,” explains Toby, “This means software engineering expertise has become vital, not just in ensuring applications are developed in a robust, scalable way but also in determining speed to development, deployment and faster reuse for future models/data consumption. Traditional software engineering and DevOps have been welcomed into the data science world, enabling a faster path from experimentation to production, more effective reusability of both code and data assets, and more robust, resilient solutions (e.g. DataOps).”
Testing data systems is painful. Invariably, it turns out to be an overly complicated exercise where you not only have to test the behavior of the SQL queries, ETL scripts, orchestrators, and so on, but you also have to test the data itself. Although testing the data introduces quite a lot of complexity in the testing process, it adds disproportionately to the value derived from the testing effort. That’s what makes it worth doing.
While it seems extremely important to test data systems, not all teams invest in this direction early enough. Rather than having test coverage from inception, most…
In this tutorial, you are going to learn about QuestDB SQL extensions which prove to be very useful with time-series data. Using some sample data sets, you will learn how designated timestamps work, and how to use extended SQL syntax to write queries on time-series data.
Traditionally, SQL has been used for relational databases and data warehouses. In recent years there has been an exponential increase in the amount of data that connected systems produce, which has brought about a need for new ways to store and analyze such information. …