Thursday, January 20, 2022

The Properties Problem

Owning offline data pipelines is not trivial.  As pipelines get more complex, more layers of abstraction are introduced and layered upon each other.  The knowledge gets wider, and the knowledge base cannot keep up.  Operating these pipelines complicates it even further.

Offline data jobs can take arbitrary key-value input properties ("input property set"), and can arbitrarily output properties ("output property set").  e.g. the date to run a job for: {"date": "2022-01-20"}.

  • J = jobs in a flow, I = input property sets, O = output property sets
  • |I| = |O| = |J|
  • |I| + |O| = 2|J|
  • Each property set has many properties
  • For any job j, its input property set j.i does not need to equal its output property set j.o
    • j.i does not need to equal j.o
  • For any job j, for any property key k in the input property set j.i, the value can differ in the output property set j.o
    • if k in j.i and k in j.o: j.i[k] does not need to equal j.o[k]

That's a lot of properties!  Understanding and operating flows become much more complex than it needs to be.  You can pass a job some property, and get a completely different property out, or not even get the property out.  The only way to understand what's happening is to check your expectations in code for each property on each job, or understand the implementation of each job in full.  Neither is practical at scale.

The properties problem is not unique to offline data jobs.  Python, JavaScript, or other dynamically-typed languages suffer the same, and I think offline as a whole can learn from it.  Python tries to solve it by mypy, JavaScript by TypeScript, and libraries built with strong contracts in mind.  The stronger a contract is, the easier it is to understand systematically.  These are steps in the right direction.  Some libraries even go a few steps further with immutability, validations, and transformations, like dataclasses.

Offline data can learn from programming languages.  Imagine a job contract like dataclasses as the input and output property sets:
  • Each job outlines its input and output property key-sets.
  • Each property set is a str -> str mapping.
  • Each property can define transformations.
    • e.g. type conversion
      • {"count": "1"} -> {"count": 1}
      • {"date": "2022-01-20"} -> {"date": Date(year=2022, month=01, day=20)}
    • Sometimes, properties are so long that a common workaround is to encode them with something like base64.  e.g.:
      • {"json_config": "eyJkYXRlIjogIjIwMjItMDEtMjAifQ=="} -> {"json_config": {"date": Date(year=2022, month=01, day=20)}}
  • j.i does not need to equal j.o
  • if k in j.i and k in j.o: j.i[k] = j.o[k]

Formalities aside, each job can define immutable typed input & output properties.  Not every pipeline needs such strict contracts, but certainly for complex cases, it can ease understanding and operability.

Cascading Data Quality

Sometimes data flows publish datasets with enough discrepancies that they cause, often silent and non-obvious, downstream impact. Imagine a ...