It is time to shift data left

jeudi 17 juillet 2025, 11:00 , par InfoWorld

When I started in this industry, there was a bad but common practice. Each new project started with a database schema. The DBAs or whoever was in charge of Oracle would have a long discussion and many meetings, and then you, the developer, would be “blessed” with a schema. The schema was usually a bit wrong and inefficient, and didn’t match what you were doing, so you wrote inefficient queries to work around it until you got yelled at and everyone agreed to fix things. This changed with object-relational mapping tools like Hibernate in Java and Entity Framework in.NET. It changed more seriously when we moved to “schema on read,” first with Hadoop and later with Amazon S3 and Parquet files and whatever.

The old system was slow and painful but it did protect against unexpected change. The modern system empowers data producers to change but disempowers the people whose job it is to provide stability. Most organizations have some data platform team whose job it is to provide omniscience despite being woefully outnumbered. That might sound like a good deal for developers—all of the power and none of the responsibility and a team of people who are there to take the whipping. However, it doesn’t work out that way. As a developer, you’re either breaking downstream data systems, including the fancy new AI system, or you’re afraid to break things and moving too slowly.

When data ownership moves upstream

Consider this story. Jez, a senior engineer on the Support Platform team, spots this payload:

{
'ticket_id': 'zendesk:004123',
….
}

Ten years ago, the company migrated from FogBugz to Zendesk and prefixed every legacy ticket ID with the system name to avoid collisions. FogBugz is long archived, but every row is written with the non-numeric prefix zendesk:.

Commit comment: Why keep eight redundant characters?

Jez revises one line:

- ticket.ticket_id = 'zendesk:' + source_system_ticket_id;
+ ticket.ticket_id = source_system_ticket_id;

Unit tests pass, the code builds locally, and check-in fails instantly:

❌ CONTRACT-CHECK FAILURE

Field “ticket_id” no longer matches ^zendesk:d+$

Breaks:

finance.dashboard.ticket_volume

ml.model.ticket_attribution

A quick revert, another push, and the build turns green. Total time lost: 30 seconds.

You see, Jez couldn’t check the code in because it violated a data contract. A Git action automatically detected that the code would violate the contract and break a downstream data system.

What would have happened without the data contract?

An ETL job writes the trimmed IDs to a Parquet file.

The attribution model silently drops 40% of new records.

Compliance rules forbid the deletion of any data, so engineers craft shims for both formats.

A 30-second opportunistic optimization balloons into a week-long painful campaign.

Everyone remembers Jez as the one who broke everything with one line.

This is a real story, but the names and details have been changed to protect the guilty. If you want a public story, consider the Mars Climate Orbiter, which crashed and burned, resulting in hundreds of millions of dollars lost because a field contained customary units instead of metric units.

A shift-left tool kit for data

A growing movement reframes the problem: data is code. Every record begins life in application logic: a TypeScript event, a Java entity, a Python variable. If code produces the data, the correct place to assert expectations is inside the code base, not downstream. Chad Sanderson highlights this principle in the “Shift Left Data Manifesto,” arguing that the only sustainable path to trustworthy AI and analytics is to treat data changes in the same manner as software changes.

Once you accept that data is code, the familiar shift-left tool kit applies:

Static analysis parses application code to identify data-producing structures before execution.

Data contracts define shape, semantics, lineage, and ownership, which are checked automatically in continuous integration (CI).

Change-impact analysis warns developers when an innocuous refactor will break a machine learning feature downstream.

Policy as code for governance evaluates compliance rules (PII handling, retention) at build time rather than audit time.

There are new software platforms ascending that make this change possible, such as Sanderson’s Gable. (Full disclosure: I’ve done some consulting for Gable.) By scanning source repositories, the engine identifies tables, events, or documents that the code will create. It then drafts contracts, maps downstream dependencies, and blocks merges that violate declared expectations. Crucially, notifications target the same developer who opened the pull request, aligning accountability with control and moving “change management into the hands of application developers.”

This upstream enforcement mirrors how static application security testing (SAST) tools pushed security fixes earlier. Now, a minor change to some data that a critical system depends on is caught before the feature branch merges.

Jez wasn’t reckless; Jez trimmed eight bytes. Without a data contract, that micro-optimization would have cascaded into a multi-team outage. With shift-left checks, it became a harmless 30-second blip, just as unit tests and SAST transformed quality and security.

Quality shifted left. Security shifted left. Data is next. When contracts join the CI gate, teams finally ship faster and sleep better with no eight-byte surprises required.

Lire la suite sur InfoWorld