The Case for Danger-Based mostly Information High quality

Gleb Mezhanskiy spent years constructing instruments to make enterprise knowledge clear. In March 2026, the Datafold CEO advised his viewers the trouble by no means paid off the way in which software program monitoring did for firms like Datadog. His argument is now reshaping how knowledge leaders outline high quality going into the again half of 2026.

A Vendor CEO Calls the Business a Disappointment

Mezhanskiy laid out the case in a March 5 put up titled “Information Engineering in 2026: 12 Predictions.” Prediction eleven argues knowledge groups will cease chasing knowledge high quality as a result of AI brokers care about context as an alternative. Years of funding and engineering effort, he wrote, by no means produced a breakout success corresponding to Datadog’s rise in software program monitoring. Information high quality, in his telling, moved from a line merchandise on annual targets to one thing groups deal with on a best-effort foundation.

The declare deserves a caveat earlier than it travels any additional. Mezhanskiy runs one firm in a crowded discipline, and his view displays a single vendor’s vantage level, not an trade consensus. Monte Carlo reported elevating $236 million, whereas Bigeye reported $73.5 million in complete funding. Datafold individually introduced a $20 million Sequence A. Collectively, the three firms disclosed at the least $329.5 million in funding, unfold throughout high quality, reliability, and observability moderately than one tidy class. The label of failure is dependent upon which yardstick will get used, and Mezhanskiy picked a yardstick favoring his prediction.

Why Information Resists the Software program Playbook

Mezhanskiy’s strongest level has nothing to do with funding rounds. He argues knowledge is tougher to check than software program as a result of floor fact retains shifting. A login both succeeds or it doesn’t. An “energetic person” can imply three various things relying on whether or not advertising and marketing, product, or finance is asking, and no quantity of column-level testing settles a disagreement over definitions. Add extra alerts to a loud pipeline and the worth of every new alert drops quick.

The comparability holds up higher as a spectrum than a tough line. Safety groups chase ambiguous indicators day-after-day, and loads of AI-driven software program ships with outcomes no one can confirm with certainty. Software program floor fact shouldn’t be at all times as clear as Mezhanskiy’s framing suggests. What units knowledge aside is the dimensions of the paradox: a single warehouse can maintain dozens of conflicting definitions for a similar enterprise idea, and a software program staff not often faces so many forks in the identical afternoon.

What AI Brokers Want

Right here is the a part of Mezhanskiy’s argument value taking critically even with the caveats hooked up. An agent pulling from a warehouse wants greater than a validated column. It wants lineage exhibiting the place a quantity got here from, the transformation logic behind it, documentation explaining why a fallback desk exists, and an ontology connecting enterprise entities like buyer, order, and product. Mezhanskiy calls the mix a context graph, and Datafold now sells one alongside its standard high quality instruments.

Price saying immediately: Mezhanskiy shouldn’t be a impartial narrator. His firm earnings if patrons shift spending from high quality monitoring towards the context layer his prediction describes, and the monetary stake doesn’t vanish simply because the underlying argument is affordable. The argument nonetheless has a restrict: context helps an agent interpret a quantity appropriately, but it surely doesn’t flip a corrupt, stale, or biased quantity right into a protected one. Lineage tells an agent the place knowledge got here from, not whether or not the info deserved belief within the first place.

What Danger-Based mostly Information High quality Seems to be Like in Apply

Essentially the most helpful proof on this debate shouldn’t be a prediction. It’s a sample already exhibiting up in how groups construct knowledge contracts. The Open Information Contract Commonplace, revealed by Bitol below Apache 2.0 by means of the LF AI and Information Basis, defines a vendor-neutral YAML format overlaying schemas, high quality guidelines, possession, assist channels, and repair ranges. Groups use the usual to formalize what a dataset guarantees, with out locking into one firm’s platform.

A handful of working habits separate groups getting worth from contracts from groups including paperwork:

Put accountability on the staff producing a dataset, not the staff consuming it three pipelines downstream.
Retailer contracts as code in model management moderately than as a slide deck no one opens once more.
Run checks in CI or within the pipeline itself, so a violation will get caught earlier than it reaches a dashboard or an agent.

Monte Carlo’s steering tells prospects to maintain contracts light-weight and geared toward pipelines carrying actual enterprise weight, moderately than each desk within the warehouse. Soda and Atlan assist the identical sample by means of YAML information, Git workflows, and rule enforcement, and neither positions its tooling as a requirement: every treats automation as a comfort layered on prime of a self-discipline a staff may run with a textual content editor and a CI pipeline.

The Counterevidence

Market researchers don’t assist a collapse story for data-quality spending. Mordor Intelligence estimates the data-quality instruments market will develop from $3.27 billion in 2026 to $7.39 billion by 2031, a 17.7 p.c compound annual progress price. Deal with the determine as a industrial estimate moderately than an audited complete. Completely different analysis corporations outline the class otherwise and land on numbers disagreeing with one another, which is regular for a market this fragmented and says extra about inconsistent definitions than in regards to the underlying pattern.

The safer learn: spending retains rising whereas the definition of high quality will get broader. No proof reviewed for this piece exhibits budgets transferring from high quality monitoring into context graphs. Joe Reis’s 2026 State of Information Engineering Survey, drawn from 1,101 practitioners over two weeks in late 2025, makes a easy level: high quality has not slipped down anybody’s listing of worries. Thirty-four p.c of respondents named knowledge high quality or reliability as a serious drain on staff time, and simply over ten p.c known as it their single largest organizational bottleneck. Respondents skew senior and concentrated in North America and Europe, and Reis describes the odds as indicative moderately than definitive. Even with caveats, the survey factors towards groups stretched skinny by high quality work, not groups strolling away from it.

A Framework for Tiering Information High quality

The sensible transfer is to not decide a facet between Mezhanskiy’s prediction and the survey knowledge. It’s to cease treating each dataset prefer it deserves the identical degree of scrutiny. A four-tier mannequin provides knowledge leaders a place to begin for deciding the place strict contracts belong and the place lighter documentation will do.

Tier 0 covers income and regulatory important knowledge: billing methods, monetary reporting feeds, and compliance submissions. Every dataset right here will get a proper contract, automated checks operating in CI, a named proprietor, and an on-call web page when one thing fails.

Tier 1 covers buyer and product important knowledge: dashboards prospects see immediately, metrics executives report externally, and machine studying options feeding customer-facing predictions. Every dataset nonetheless will get a proper contract, with scheduled checks and an alert routed to an proprietor, although with out paging anybody at 2 a.m.

Tier 2 covers inside and operational knowledge: advert hoc reporting, inside analytics, and experimentation tables. Light-weight documentation and preserved lineage matter greater than a proper contract right here, since a mistake stays contained inside one staff.

Tier 3 covers exploratory knowledge: one-off exports, scratch tables, and prototype datasets. No contract applies, no high quality assure exists, and every dataset carries a transparent label saying so.

Three questions place most datasets appropriately.

Would a unsuitable quantity set off monetary loss, authorized publicity, or a regulatory submitting downside?

Tier 0. Does the dataset feed a customer-facing floor or a metric reported exterior the corporate?

Tier 1, until the monetary or regulatory publicity already pushed it to Tier 0. Does a couple of staff depend on the dataset for choices, with none exterior or regulatory stakes hooked up?

Tier 2. Something left over, one-off exports and prototypes included, defaults to Tier 3.

As soon as a dataset earns a contract, the doc wants six fields, no matter format:

Schema and knowledge varieties for each discipline a shopper may contact, with nullable fields and anticipated ranges spelled out.
Freshness and availability targets said as a quantity, not an outline: up to date inside 4 hours, obtainable 99.5 p.c of enterprise days.
High quality thresholds and the checks implementing them: completeness, uniqueness, and any enterprise rule particular to the dataset.
A named producer staff, a named shopper staff, and an escalation path for when the 2 disagree.
A change administration course of describing how schema adjustments get introduced and the way lengthy customers get to adapt.
A assist channel, said by identify, the place a shopper reviews an issue and will get a response time dedication.

For an illustrative instance, image a subscription firm assigning its month-to-month recurring income desk to Tier 0. The six fields may learn:

Schema: customer_id (string, not null), mrr_amount (decimal, zero or better), billing_period (date).
Freshness: up to date inside 4 hours of every billing run.
High quality checks: completeness at 99.9 p.c or greater, uniqueness enforced on customer_id plus billing_period.
Possession: the Billing Platform staff produces the desk, Finance Reporting consumes it, and disputes escalate to the on-call knowledge engineer inside quarter-hour.
Change administration: schema adjustments get introduced two weeks forward within the #data-contracts channel.
Help: a named inbox commits to a response inside one enterprise day.

A scratch desk feeding a one-off cohort evaluation wants none of this. The price of writing six fields for each desk within the warehouse is strictly why most contract packages stall, and tiering exists to maintain the fee pointed on the knowledge the place it pays for itself.

Information leaders monitoring this mannequin ought to watch incident affect, detection time, false-alert quantity, and contract violations by tier, moderately than a single company-wide high quality rating hiding the place the true harm occurs. A Tier 0 violation and a Tier 3 violation aren’t the identical occasion, and a dashboard treating them equally will bury the sign leaders want most.

The place Tiering Breaks Down

Two failure modes present up nearly instantly as soon as a staff adopts a tier mannequin, and neither entails the framework itself.

The primary is tier creep. Each staff believes its knowledge issues most, and a mannequin with no enforcement mechanism drifts towards labeling the whole lot Tier 0 inside a yr. The repair is procedural moderately than technical: route Tier 0 nominations by means of finance, authorized, or whichever perform owns the regulatory publicity, and require a said greenback determine or compliance quotation earlier than a dataset earns the label.

The second failure mode connects immediately again to Mezhanskiy’s argument. A tier task lives in an individual’s head or a wiki web page until somebody writes it into metadata an agent or a question engine can learn. An AI agent pulling from a warehouse has no solution to know a desk is a Tier 3 scratch dataset until the task travels with the desk itself, by means of tags, a catalog entry, or the context graph Mezhanskiy’s firm sells. Skip the tagging step, and an agent can pull from an unreviewed prototype desk to reply a query belonging to Tier 0. The end result turns a low-stakes dataset right into a high-stakes resolution, and no one notices till one thing breaks.

Tiers additionally want a overview cadence, since a dataset’s danger profile not often stays fastened. A cohort evaluation constructed for one board assembly can flip right into a recurring metric a CFO quotes externally inside two quarters, at which level the dataset has quietly moved from Tier 3 to Tier 1 and not using a contract ever hooked up to it. A quarterly re-tiering overview, owned by whoever runs the info platform, catches the drift earlier than a metric goes exterior.

Information groups aren’t giving up on high quality. They’re admitting common protection was at all times a fiction, and the fiction grew costlier as soon as AI brokers began operating on the identical warehouses people used to babysit by hand. Sturdy packages in 2026 will resolve, in writing, the place dangerous knowledge does actual harm, defend the bottom laborious, and go away sufficient context behind for folks and machines to deal with the whole lot else with open eyes.

The Math Abilities Each Aspiring Information Scientist Must Grasp Earlier than Writing a Single Line of Code

Google Spent $2.7 Billion to Preserve Noam Shazeer, OpenAI Obtained Him Anyway |