My All-Time Most Viewed Stories
In an earlier post in my series on automation testing, I talked about automation tests' broad categories to write for databases and warehouses. I’ll go over specific types of tests in a later post in this series. For now, you can get the gist of it here.
This post will talk about the common problems data engineers face when testing databases and warehouses. Most of the problems I’ll discuss I have had a first-hand experience with, and the rest I got to learn about by discussions with my awesome friends and colleagues in the data engineering space. Let’s get started.
There are many reasons why reacting to time series data is useful, and usually, the quicker you can respond to changes in this data, the better. The best tool for this job is easily a time series database, a type of database designed to write and read large amounts of measurements that change over time.
In this tutorial, you will learn how to read data from a REST API and stream it to QuestDB, an open-source time-series database. We will use Grafana to visualize the data and alerting to notify Slack on changes that interest us. …
What an amazing story written with such great clarity - a decade of work summarized! It's always exciting to know how different companies approach problems of data engineering. We must all realize that there's no one "right" way of doing data engineering.
Couldn't agree more with Snowplow's observation about measuring the accuracy and completeness of data being a process rather than a project —
"We won’t lie — measuring the accuracy and completeness of your data is not easy and is a process rather than a project. It starts at the very beginning: how you collect your data. And while it never ends, quality grows with better-defined events going into your pipeline, data validation, surfacing quality issues and testing."
Good read about data quality!
Loved this piece - very informative and full of accurate observations about data engineering teams. I have been working on a similar Data Vault / Data Warehouse setup recently. I could relate.
By the way, kudos to the analogy department!
This article talks about some of the major challenges that data engineers face while creating and maintaining data lakes.
I have dealt with five out of the six challenges that have been discussed here, not necessarily using Spark though.
Must read for data engineers!
I wasn't aware of Jonathan Mann's existence but I was following kind-of the same philosophy when creating music. During the last year, I might have completed more than a hundred compositions. I didn't write the songs myself, though. I just composed poems of my favorite poets.
Most of the compositions were mediocre, some were really bad, but there were a few which were pretty good. For me, the ratio might have been 75-20-5, but I guess this might get better if I keep at it. Maybe I'll also start posting my stuff somewhere in a year or two.
Fantastic article! You have walked through the whole process in quite some detail. Many years ago, I did something similar, but the scale wasn't as big. But the difference was that we were not just doing a database upgrade on MySQL; we were also migrating one schema to another (as the app was moving from a monolith to a microservices-based architecture) added another level of complexity to the process. Although those were a tense couple of weeks, we were able to do it quite well by the end of it.
A fascinating read for anyone who loves working with databases at scale!
From Toby Skyes, Global Head of Data Engineering, QuantumBlack.
“There’s certainly more discipline in how analytics solutions are being built, as many are being created with an expectation that the data pipelines eventually scale beyond the initial project scope,” explains Toby, “This means software engineering expertise has become vital, not just in ensuring applications are developed in a robust, scalable way but also in determining speed to development, deployment and faster reuse for future models/data consumption. Traditional software engineering and DevOps have been welcomed into the data science world, enabling a faster path from experimentation to production, more effective reusability of both code and data assets, and more robust, resilient solutions (e.g. DataOps).”