Tuesday, January 25, 2022

Cascading Data Quality

Sometimes data flows publish datasets with enough discrepancies that they cause, often silent and non-obvious, downstream impact.

Imagine a ramp of your new feature looked to improve engagement by +25%.  If you're lucky to catch the silent issue, you later realize it's -10%.  How long have your users been ramped on that feature?  What's the (opportunity) cost of it?  What if you included that bad metric in your quarterly earnings report or to your investors?

Let's call it what it really is: a failure.  If a data flow is responsible for producing datasets, they must be correct and expected.  Propagating bad downstreams is a failure in a data flow's responsibility, regardless of whatever the naive enum status of the flow says: "SUCCEEDED".


Solution 1

A common band-aid is to introduce data quality (DQ) checks on the dataset; if the DQ is bad, fail the flow which produced the dataset.

The problem lies in where the DQ check job is typically in the flow.

Consider 2 data flows:

  1. Flow 1 writes output dataset D (with bad DQ), then checks DQ on D and fails.
  2. Flow 2 reads input D.

Even though flow 1 failed, D was read by downstreams ðŸ˜”

Now flow 2 will propagate it further.  The DQ check in flow 1 only lets us know that there's an issue, which is better than the state without a DQ check (silent cascade), but it's not good enough.

The impact is massive if D is a "source-of-truth dataset" (near the root of the lineage in your data lake and read by tons of downstreams).  How is this typically resolved?  Fix one of the upstreams, then backfill (retry) each flow in the lineage, layer by layer, like a topological sort.  You never expected to use your LeetCode skills on the job, right?  Thankfully this is a well defined problem and can be solved with clever automation, saving tons of operational hours.

It shouldn't be this difficult.  You can prevent downstreams from reading D in the first place.


Solution 1 + 1, one step further

We have to split the "write outputs" job into two:

  1. Privately write outputs.  e.g. write table to private namespace with restricted permissions.  You can think of this like a waiting area until the DQ check returns good.
  2. Publish outputs to the public.  e.g. move the table from private namespace to public namespace with permissions readable by downstreams
The DQ check is moved between the two jobs instead of at the end of the flow.


Reconsider flows 1, 2


If the DQ check fails, it will not publish D, preventing downstream reads on a bad dataset.  If downstreams want to read stale (older) published versions of D, they are free to do so.


If the DQ check succeeds, flow 1 finalizes by publishing D, enabling downstream reads.


Cascading Data Quality

Sometimes data flows publish datasets with enough discrepancies that they cause, often silent and non-obvious, downstream impact. Imagine a ...