Photo by Sigmund on Unsplash

Data Quality and Testing Frameworks

A short introduction to open-source data quality & testing tools — dbt, Deequ, and Great Expectations

Kovid Rathee
Cognizant Servian
Published in
8 min readMay 11, 2022

--

Background

Data quality has always been an important theme while building data systems. Still, data engineering teams of the past have had bigger fish to fry, making sure the data lands up from one place to another (in whatever way or form). Data quality often surfaces as a problem when the data has already reached its destination in a data pipeline you engineered.

It soon becomes a pain because the data you so painstakingly moved between two worlds cannot help you extract the value you expected. Why? Because some columns have null values, some have had a variety of date formats, while others have a mixed bag of data quality issues that couldn’t have been caught by manually checking a sample of the data.

The data quality problem is offloaded to the data analysts and analytics engineers to solve. I believe it is not their problem to solve. It would be best if you built quality checks and tests into the pipeline that moves the data around, reshapes it, and, in some cases, summarizes it. The moment you detach quality and testing from the pipeline, issues arise.

Getting Started with Data Quality

It won’t be presumptuous to assume if you have a reasonably mature deployment system in place, you’d already have some unit and integration tests for your code. They might not be related to the quality of data at all. Adding the capability to test data within your pipeline involves choosing the right tools. You’ll be tempted to write SQL queries or Python functions to test data. I know because I have been that person. If you write custom tests for all of your data, it will become its own separate project and eat up quite a lot of your time. Don’t do that.

Use templates and try to write the minimum number of tests to cover the highest number of use cases while still maintaining the individuality of the tests. Tools like dbt, Great Expectations, PyDeequ, a combination of Cucumber, Gherkin, and Jinja2 with pytest, etc., allow you to do that. You can go the commercial route and use something like Ataccama ONE, Informatica Data Quality (IDQ), etc. Still, in this article, I’ll take you through some of the most widely adopted open-source data quality frameworks, such as Deequ, Great Expectations, dbt, etc. At the end of the article, you’ll hopefully be more aware of the DQ space and possibly in a better position to make data quality central to your data pipelines.

Great Expectations

This tool is built on the core abstraction of an Expectation, which, essentially, describes what the expected shape of the data (data type, null-ness and otherwise, precision, etc.) should be. The reason for this tool’s existence is that most of the automation testing tools at the time only supported tests for code. None of them offered tests for data. This guiding principle is quite apparent from the article announcing the release of Great Expectations back in 2018:

Instead of just testing code, we should be testing data. After all, that’s where the complexity lives.

If you wanted to test the data, you’d probably have to write custom SQL queries on every layer of your data pipeline. We all know how much fun that can turn out to be. Data engineers and analysts, more or less, come to terms with messy data after a while and write SQL to fix their data pipeline jobs and reports (while crying their hearts out). A lot of other less technical and business-facing roles might not be able to fix data with SQL queries, worsening the already terrible situation. From simple data pipelines to reports and dashboards, from MLOps to transactional databases, Great Expectations can fit anywhere, which is why it has seen widespread adoption in the last few years.

Some Expectations

Great Expectations allows you to define expectations in a JSON file or inline with your code. Below are some examples of the in-line Expectations from a survey data set, where you’ll see the number of data quality aspects being checked.

Some types of Expectations.

The above example uses pre-defined Expectations for Pandas. Great Expectations has backend support for Pandas, Spark, and SQLAlchemy. You can see which expectations are available for your choice of backend here, as shown in the image below:

Backend Support for Great Expectations

If you don’t meet your Expectations in the list of pre-defined Expectations, you can always create your own.

Integrating Great Expectations Into Your Data Workflow

Whether feeding data into a SageMaker model, a feature validation engine, a set of reports on top of a Snowflake data warehouse, an ML workflow engine, a data orchestrator like Airflow, a data lake on Databricks, or even a traditional transactional database, or a non-traditional transactional database, the only reasonable way to use Great Expectations is to run it as part of your data or CI pipeline.

Whatever your tools and workflow, you can find a way to integrate Great Expectations into it. I was recently using Great Expectations with PySpark in a JupyterHub setup. I found one of my ex-colleagues' blogpost and a Docker-based quick start tutorial by Dataroots very useful.

dbt

In this blog post, dbt presents a good analogy of a full-scale data pipeline and the different check posts where your pipeline needs testing, data quality checks, and validation. Ideally, such a check post should exist in your data pipeline where data is being ingested, pushed out, or transformed. Using dbt utils, you can perform both structural and data tests for your models and transformations. You can start here with a post by my colleague where she talks about setting up basic tests and data quality checks using dbt.

Some Data Quality & Testing Examples using dbt

Using the survey_data data set as an example, you can define, more or less, the same tests that you specified using Great Expectations. dbt uses YAML files to define tests. Here’s an example covering four types of data quality tests using the dbt_utils package. Here’s how you can define tests in dbt:

  • accepted_range — To validate whether the column value lies within the expected range.
  • not_accepted_values — To check whether the column contains unacceptable values.
  • not_null_where — A simple IS NOT NULL quality check for any given column of your table represented as a dbt model.
  • unique_combination_of_columns — As dbt integrates tightly with data warehouses and lakes, it also provides data quality checks supporting specific types of data models that you can use generically. Engineers use this check to ensure the uniqueness of columns to create surrogate keys for data warehouses.
GitHub Gist for a sample YAML file for defining tests in dbt.

dbt Helps with Testing & Data Quality of Transformation Layers

dbt offers a deeper level of testing with very useful predefined tests for data warehouse-modeling techniques like Dimensional modeling and Data Vault 2.0. Tests and data quality checks involving aggregates, case statements, table comparisons, and summaries are extremely useful in the transformation layers of the data pipeline. A few examples of what you’d want to check in a hypothetical scenario where you have a data lake in S3, a staging layer, and a data warehouse in Snowflake are as follows:

  • When moving data from S3 to the Snowflake staging layer, you should perform basic data quality checks like checking for NULLs, column formats, data types, etc.
  • When moving data from the Snowflake staging layer to DataVault, you should also check for column counts and other aggregates for loose integrity checks.
  • When moving data from the Snowflake DataVault to a dimensional schema, you should be checking again for column counts in the source and the target. The more complex individual, row-level tests are performed with the help of introspective macros, which allow you to iterate over the result of a query.

dbt provides several helper libraries that you can use to achieve this testing and data quality functionality.

Deequ

Suppose you’re working with Spark in any way or form with EMR, Glue, Databricks, Jupyter notebooks, etc. In that case, you’d have heard of the Spark-native library for unit testing and measuring data quality called Deequ. This utility comes from AWS Labs. They have also released a Python avatar for Deequ called PyDeequ, given the popularity of PySpark. Deequ takes a slightly different approach than dbt and Great Expectations.

Analyze, Suggest, Verify, and Store

Deequ lets you profile and validate data using Analyzers. In addition to the profile summaries, Deequ also helps you with field-level data validation. Based on the profiles of your data and some heuristic rules, Deequ’s Constraint Suggestion functionality suggests what constraints you should be running in your Verification suite to enforce a quality check on your data. Here are a couple of examples of the suggestions Deequ might make for your data set:

  • Completeness — Measures the presence of null values, for example isComplete("email") or isComplete("phone").
  • Consistency — Consistency of data types and value ranges, for example, isContainedIn("state"), (Array("VIC","NSW","ACT","QLD")).

For a more detailed explanation of the data validation and verification process, and the philosophy behind the design choices and types of constraints, read this paper which contains research done at Amazon for making Deequ possible.

In addition to the suggested checks, there are a host of checks available that you can add to the Constraint Verification phase. You can write your checks too. Here’s a simple example of how you’d run checks using the VerificationSuite in PyDeequ using the survey_data dataset and similar tests as performed using dbt and Great Expectations:

Once the constraint verification process completes, PyDeequ will write a summary report on a path of your choosing; in this case, we’re printing the results DataFrame, survey_df_check_result, to the console. It helps to see if your fixes to correct data somewhere in your data pipeline or at the source have worked or not. PyDeequ also allows you to define a Metrics Repository, which can help you track the quality of your data over time.

Different Approaches to Tackling the Data Quality Problem

Deequ provides is a great way to work with your Spark code, whether written in Scala, Python, or SparkSQL. As long as Deequ receives data in a DataFrame (or a DynamicFrame, in the case of AWS Glue), it will be fine. Note how the three tools we’ve discussed have approached the same problem with different techniques. That’s happened because of two main reasons — the processes that these tools support are different, and the types of tests those processes need are different. dbt, as they sell it, is for transformations, Great Expectations is a general-purpose data quality suite, and Deequ is for large-scale data quality checks when you’re underlying infrastructure is on Spark.

Conclusion

Most of the time, your data engineering project timelines don’t allow you to add in data quality checks is a valid excuse. Not reviewing code, not writing tests, and not taking care of data quality won’t hurt at the time. However, it will eventually catch up with your project and diminish the return on investment that the business will get from the whole exercise.

So what do you have to do not to make that excuse? Push the data quality agenda from the word go. Initially, no one would listen (been there). Keep pushing until one day you get the budget to do it. And if all the efforts fail, forgive everyone for they know not what they do, and maybe work at some other place that values and appreciates data for what it’s worth.

P.S. — Shoutout to Deepank Chopra, Marat Levit, and Adam Weisser for reviewing and helping out.

--

--

I write about tech, Indian classical music, literature, and the workplace among other things. 1x engineer on weekdays.