The Data Quality Pyramid: A Path to Getting Started with Data Observability

Data quality can be complex but it doesn’t have to be.

Chris Lettieri

May 11, 2024

Data quality can be complex but it doesn’t have to be.

What framework do you use?
What use cases do you solve?
How do you manage costs?

If you’re starting your data quality journey, here’s what I’ve learned:

Data Quality is Trust

The whole point of data quality is so that the consumer can trust the data before doing something with it.

To build trust, users need to know the data meets their expectations.

Follow this three step framework:
1. Collect metadata & metrics about your datasets.
2. Validate these against some definition of “healthy”.
3. Make it all visible.

With a strong emphasis on the visibility part. Users need to know what’s going on with the data they are about to consume. If all they see is data in a BI tool, the best they can do is make assumptions.

Provide visibility to build the trust.

Rethinking the 5 Pillars

Barr Moses broke the Data Engineering internet a few years ago with the 5 Pillars of Observability.

It’s a good framework to help users trust their data. Here are the five pillars:

Freshness — is my data up-to-date?
Volume — is my data complete?
Schema — does the schema match expectations?
Lineage — what produced this data upstream?
Quality — is my data within expected ranges?

With the goal of this article to get a data quality system implemented fast, let’s rethink the 5 pillars.

Remove Schema

I’d drop schema from this list when starting out.

Failures with bad schema will be noisy (ie. KeyError, failed insert). You don’t need extra observability here. The other pillars need explicit monitoring as they often fail silently.

Of course, if you’re not in a hurry, go ahead and validate schema.

Prioritizing the Remaining Pillars

Quality requires you to read the actual data. This is expensive. It shouldn’t happen for all your datasets, only the most critical.

Freshness, volume, and lineage are all available in the metadata. For data warehouses like Snowflake and Redshift — query their info schemas. For lakehouses, look at what’s available in the Iceberg catalog.

Starting Pillars (Metadata Only)

Freshness
Volume
Schema

Advanced Pillar (Read the actual data)

Quality

With this framework in mind, we have a clearer place to focus when starting with data quality.

Start with the Metadata

Make the metadata collection seamless for your users.

Metadata allows you to collect 3 of the 5 data observability pillars. If you have no data quality monitoring across your ecosystem, start with the metadata available to you. It covers a lot of use cases (ie. everyone wants to know freshness) and costs the least to implement.

Once the metadata is covered, you can add the more complex and expensive DQ metrics for business critical datasets.

Freshness and Volume

In Snowflake, you can query the TABLES View in the Information Schema.

This query will retrieve row counts and last time the table was altered (a write was attempted) for every table in the given database:

SELECT 
 TABLE_CATALOG AS database,
 TABLE_SCHEMA AS schema,
 TABLE_NAME AS table,
 LAST_ALTERED as last_altered,
 ROW_COUNT AS row_count
FROM 
 INFORMATION_SCHEMA.TABLES
WHERE 
 TABLE_CATALOG = ‘<db>’;

Lineage

Your workflow orchestrator or SQL history can provide this.

If you use a tool like dbt or Airflow, the DAG is your lineage. Both systems provide ways of visualizing and extracting this information.

If your ecosystem is more complex you can try using the SQL history on your data warehouse to track table write statements. This requires some logic to parse SQL history and build a dependency tree. Or check out a tool like OpenLineage.

Lineage should be shared with end users too, not just engineers.

Internally Open Source the Quality Metrics

There will be too many use cases to satisfy everyone in the quality level of our DQ pyramid.

Make it easy for teams to add the custom checks they want. This can be as simple as accepting custom SQL statements and executing them on a schedule for teams. The key is to remove yourself from being a blocker.

The teams producing data know their data better than you. They understand what checks matter, so let them decide what’s worth implementing.

Provide a standard interface to write DQ checks and execute them, let the users do the rest.

Every Journey is Different

Data quality means different things in various contexts.

Start with the metadata and make that seamless for your users.

Then expand to more complex and expensive quality metrics.

And remember, all this is in service of building trust in your data.

Augment, Stay Human

Discussion about this post