DATA ENGINEERING, AUTOMATION TESTING

Automation Testing for Data Systems

A short introduction to the categories of automation testing for Databases, Data Warehouses, and Data Lakes

Kovid Rathee

Published in

Cognizant Servian

6 min readMar 1, 2021

Testing data systems is painful. Invariably, it turns out to be an overly complicated exercise where you not only have to test the behavior of the SQL queries, ETL scripts, orchestrators, and so on, but you also have to test the data itself. Although testing the data introduces quite a lot of complexity in the testing process, it adds disproportionately to the value derived from the testing effort. That’s what makes it worth doing.

While it seems extremely important to test data systems, not all teams invest in this direction early enough. Rather than having test coverage from inception, most teams start writing tests once all or some systems are already in place. I know it because, in a previous life, I have created systems from the ground up without any tests except the manual unit tests written at the time of writing the code.

Servian’s own David O'Keeffe has written an excellent piece about why data engineering needs automation testing as part of a series of writings about automation testing in data engineering. I’ll be talking more about the broader categories of tests you can write and why. I’ll end by talking about some of the common issues faced while testing data systems. I’ll talk about the different approaches you can take to solve database testing in the upcoming pieces.

Why Great Data Engineering Needs Automated Testing

Automated testing is a necessity in software engineering, so why not in data?

medium.com

White-box

You can see the testing effort from two different points of view — engineering and business. The engineering team building the data systems concentrates more on the system's components, how they interact with each other, the inputs, the outputs, etc. The testing effort would usually involve writing tests for testing the structure and nomenclature of database objects, constraints on tables, referential integrity, view definitions, transformation code, and so on.

It is called white-box testing or glass-box testing because of the complete visibility of the data flow and the different components of the data system.

You usually find support functionality to test things manually in data modeling tools, SQL/Python IDEs, and ETL suites. For instance, while designing the data model, you can enable model validation to check for some of the following things and more:

Unique names of model entities, attributes, relationships
Entities not without attributes
Mandatory primary keys for every entity
Valid data types for the attributes

You can fix syntax issues in your SQL IDE and write custom validation on your SQL code with extensions. That pretty much sums up the necessary structural tests that you can do for databases and data warehouses. You can do similar tests for object-based storage systems used as data lakes. For testing views, materialized views, functions, stored procedures, and triggers, you can generate test tables, SQL queries and execute them on the SQL client. You can find a detailed list of database tests here.

Modern data systems don’t just consist of relational databases and data warehouses. Every new type of data source or target comes with its own set of testing requirements. A time-series database will have different requirements than testing a graph database. In an object-based data lake like S3, you’d probably need to check for the underlying storage format, partition dates, column compression schemes, and so on. I will talk about an exhaustive list of these tests in the upcoming posts.

Integration Testing

If all these database testing resources are available and best practices established, why do teams fail to test databases properly, you may ask. Most of the data engineering teams have a lot of this where developers perform these unit tests manually and move forward. Automation is what lacks when it comes to database testing. Although manual unit testing is a must while writing the code, it is difficult to imagine a team building a scalable and reliable product dependent on manual testing alone. Integration testing usually gets overlooked. Read more about why integration tests are super important here, and for some more wisdom on this, read Martin Fowler’s blog post where he says →

The point of integration testing, as the name suggests, is to test whether many separately developed modules work together as expected.

You’d think that integration idea is necessary for an engineering team to declare that they have tested the system. Unfortunately, that is not always the case. The idea is to invest in database testing early on, not just in unit testing but also in integration testing.

Stress & Load Testing

There are two purposes of load testing and stress testing your databases. First, you want to understand how your systems behave when they are under high load for an extended time, i.e., test the systems' limits by pushing them to the breakpoint. Second, you want to test your system's maximum throughput practically; this helps you understand the internals better and fix problems with database configuration.

In addition to the built-in database tools like mysqlslap, several standard benchmarks and tools are available for load testing, the most famous Sysbench. There are other Sysbench-like tools that other companies have created. These tools help you test throughput, concurrency, contention issues, and more.

From a piece I wrote about the Impact of having (or not having) Primary Keys on insert performance in MySQL. This is just an example command. Don’t use rand() while testing. It slows down the DB. The schema is here.

Let’s now talk about literally the dark side of testing (calling dibs on this for the title of another blog post), i.e., Black-box testing.

Black-box

I talked about seeing testing from two different points of view earlier. We talked about the engineering point of view in white-box testing. When it comes to testing business rules and functionality, we rely on black-box testing. Black-box testing is not limited to functional tests. You can perform unit tests, integration tests, and user acceptance tests in black-box testing, both functional and non-functional. White-box and black-box testing processes have different stakeholders, levels of visibility into the underlying system, and different testing methods.

In black-box testing, you should not care, or be aware, of the data system processing your data. You should only care about your inputs to the system and whether they match the expected output.

A data system might work completely fine from a systems point of view, i.e., the SLAs meet, the jobs run on time, the table definitions are correct, and no tests fail. From an engineering standpoint, everything may seem to be working fine, but you will often find that issues still exist during business functionality tests.

Handing down business requirements is a lossy process. Business and engineering teams speak different languages. To ensure that the minimum amount of information is lost while handing down requirements, it’s good to have superb documentation with decision logs, the scope of the testing exercise, etc.

Some of the examples of black-box testing are:

Testing the output of a stored procedure when you have the privileges to call the procedure but not to view the definition
Testing the metrics generated by a report using a materialized view without having access to the underlying view query
Testing speed of inserts in a database without access to the database or warehouse configuration

I have done a lot of black-box testing as a consumer of the data from systems engineered by other teams. It is a frustrating but necessary undertaking.

Testing data systems is hugely crucial to the success of a product. It is better to invest time in testing sooner rather than later. Like you choose the technology stack, the programming paradigm, you must decide how to go about testing and what tools to use — TDD or BDD, JMeter or Locust, Cucumber or Selenium, etc. I’ll talk more about this with examples in the upcoming posts.

What’s Coming Next?

In this multi-part series about database testing, I’ll talk more about getting started and discussing solutions to specific testing problems with examples. For instance, how to effectively test routines — stored procedures and functions in databases and data warehouses. I’ll also talk about the technology landscape database testing in an upcoming post. For more of my writings, teleport here. If you have any questions or want to discuss database testing, reach out to me on LinkedIn.

We, at Servian, have helped several companies build their data systems from scratch and fix issues with existing infrastructures. You can find more about us here.