Tuesday, January 25, 2022

Cascading Data Quality

Sometimes data flows publish datasets with enough discrepancies that they cause, often silent and non-obvious, downstream impact.

Imagine a ramp of your new feature looked to improve engagement by +25%.  If you're lucky to catch the silent issue, you later realize it's -10%.  How long have your users been ramped on that feature?  What's the (opportunity) cost of it?  What if you included that bad metric in your quarterly earnings report or to your investors?

Let's call it what it really is: a failure.  If a data flow is responsible for producing datasets, they must be correct and expected.  Propagating bad downstreams is a failure in a data flow's responsibility, regardless of whatever the naive enum status of the flow says: "SUCCEEDED".

Thursday, January 20, 2022

The Properties Problem

Owning offline data pipelines is not trivial.  As pipelines get more complex, more layers of abstraction are introduced and layered upon each other.  The knowledge gets wider, and the knowledge base cannot keep up.  Operating these pipelines complicates it even further.

Offline data jobs can take arbitrary key-value input properties ("input property set"), and can arbitrarily output properties ("output property set").  e.g. the date to run a job for: {"date": "2022-01-20"}.

  • J = jobs in a flow, I = input property sets, O = output property sets
  • |I| = |O| = |J|
  • |I| + |O| = 2|J|
  • Each property set has many properties
  • For any job j, its input property set j.i does not need to equal its output property set j.o
    • j.i does not need to equal j.o
  • For any job j, for any property key k in the input property set j.i, the value can differ in the output property set j.o
    • if k in j.i and k in j.o: j.i[k] does not need to equal j.o[k]

That's a lot of properties!  Understanding and operating flows become much more complex than it needs to be.  You can pass a job some property, and get a completely different property out, or not even get the property out.  The only way to understand what's happening is to check your expectations in code for each property on each job, or understand the implementation of each job in full.  Neither is practical at scale.

The properties problem is not unique to offline data jobs.  Python, JavaScript, or other dynamically-typed languages suffer the same, and I think offline as a whole can learn from it.  Python tries to solve it by mypy, JavaScript by TypeScript, and libraries built with strong contracts in mind.  The stronger a contract is, the easier it is to understand systematically.  These are steps in the right direction.  Some libraries even go a few steps further with immutability, validations, and transformations, like dataclasses.

Offline data can learn from programming languages.  Imagine a job contract like dataclasses as the input and output property sets:
  • Each job outlines its input and output property key-sets.
  • Each property set is a str -> str mapping.
  • Each property can define transformations.
    • e.g. type conversion
      • {"count": "1"} -> {"count": 1}
      • {"date": "2022-01-20"} -> {"date": Date(year=2022, month=01, day=20)}
    • Sometimes, properties are so long that a common workaround is to encode them with something like base64.  e.g.:
      • {"json_config": "eyJkYXRlIjogIjIwMjItMDEtMjAifQ=="} -> {"json_config": {"date": Date(year=2022, month=01, day=20)}}
  • j.i does not need to equal j.o
  • if k in j.i and k in j.o: j.i[k] = j.o[k]

Formalities aside, each job can define immutable typed input & output properties.  Not every pipeline needs such strict contracts, but certainly for complex cases, it can ease understanding and operability.

Tuesday, January 11, 2022

SRE is not scalable.

SRE is not scalable. Hiring dedicated engineers to fix scalability concerns is not scalable.

The constant tug-of-war with SWE & SRE is tiresome.  It drains energy.  It demotivates masses of people on both ends of the rope.  It kills "the organization".

It does not have to be this way.  SWE & SRE are ideally partners-in-crime.  They help each other in dire situations, in long term visions, and the missions to accomplish it all.  They learn and grow from each other; their dimensions are unioned to manifest in something much greater.

So why is there this perpetual game?

  • SWE pleases the needs of X with software.
  • SRE pleases the needs of X with reliable software.
  • X can be users, customers, the market, whatever.

This differentiation is the root of the conflict.  Once you, yes you the reader (or perhaps your organization), intend on hiring a dedicated engineer to fix someone else's problem, you have lost the forest for the trees.

You are losing accountability.  You are robbing growth opportunities.  You are making it amply clear that unreliable software can be fixed by hiring more people.  A management solution to an engineering problem.   The core responsibility of SWEs is to write (and perhaps wrote) the software in the first place.  The software to help your customers.  The same software to reliably help your customers.

How do you fix this?

Leadership, across all levels, need to agree that short-sighted feature delivery is not the priority.  Long-term customer value is the priority.  What produces long-term customer value? Beneficial features, performance, security, privacy, and many more, consistently ... reliably.  Furthermore, agreement is not enough.  It needs to be encouraged and rewarded through incentive structures like more ownership, more compensation, and more career growth (e.g. promotions).

The opposite however, like writing unreliable software, shall not be punished.  It is neutral at worst, and slightly positive if taken as a growth opportunity to write reliable software in the future.  The last environment you want to accidentally foster is one of fear, killing all creativity and trust in "the organization".

Hiring an SRE is not scalable.  Teach your engineers to write reliable software instead, and teach your leadership to inspire the same.  SREs are supposed to evangelize such practices and help develop people's expertise in reliability.  SREs are supposed to work on core reliability fundamentals.  They are not supposed to solely fix other people's problems.

Cascading Data Quality

Sometimes data flows publish datasets with enough discrepancies that they cause, often silent and non-obvious, downstream impact. Imagine a ...