Materialized Lake Views in Microsoft Material: When Your Medallion Matches in a SELECT Assertion

Making a PDF’s Pictures Searchable for RAG, With out Paying to Learn Them All

Python 3.14 and its New JIT Compiler

time, constructing a medallion structure in Microsoft Material meant stitching collectively a small orchestra of transferring elements: notebooks for the transformations, pipelines for orchestration, schedules for refresh, customized code for information high quality checks, and the Monitor Hub for keeping track of whether or not something really labored. Each layer labored – till one thing didn’t, and then you definately had to determine which layer broke, why, and which downstream layers received affected alongside the way in which.

In case you’ve ever tried to debug a silver layer that didn’t replace as a result of the bronze pocket book failed three hours in the past, precisely what I’m speaking about.

Then, at FabCon Atlanta in March 2026, materialized lake views (MLVs) went usually accessible. And the story they’re telling is easy: what in case your whole medallion pipeline might be a number of SELECT statements?

Let me stroll you thru the entire thing – what they’re, how they work, what modified between preview and GA, and the place they match (and the place they don’t) in your structure.

Materialized Lake View – WHAT?

A materialized lake view is a continued, mechanically refreshed view outlined in Spark SQL or PySpark. You write a SELECT question that describes the transformation you need, and Material takes care of execution, storage, refresh, dependency monitoring, and information high quality enforcement.

The result’s saved as a Delta desk in your lakehouse. So downstream customers, equivalent to Energy BI Direct Lake, Spark notebooks, SQL endpoints, can question it similar to another Delta desk. No particular dealing with, no completely different syntax.

To place it in plain English: an MLV is nothing else however a SELECT assertion that realized to materialize itself, handle its personal dependencies, schedule its personal refresh, and examine its personal information high quality.

OK, that’s good. However what does that truly change?

That’s a good query. Earlier than MLVs, constructing a single bronze-to-silver-to-gold circulate appeared roughly like this: you’d write a pocket book for every transformation, arrange a Knowledge Manufacturing facility pipeline to name them in the precise order, configure schedules, construct customized validation logic, after which wire up the Monitor Hub to observe for failures. 5 completely different surfaces, 5 various things to debug when one thing breaks.

With MLVs, all of that collapses into declarative SQL. You describe what you need. Material figures out the remaining.

The 4 levels of an MLV’s life

Each MLV strikes via 4 levels. In accordance with the Microsoft documentation, understanding them is the muse for every little thing else:

Create – You write the Spark SQL (or PySpark) that defines the transformation. Material shops the definition and materializes the preliminary end result as a Delta desk.
Refresh – When supply information modifications, Material chooses the optimum technique: incremental (course of solely modifications), full (rebuild), or skip (no modifications detected).
Question – Any software or software reads the materialized end result. They don’t know – and don’t must know – that it’s an MLV.
Monitor – Refresh historical past, execution standing, information high quality metrics, and lineage are all tracked and visualised natively in Material.

Now let’s dive into every bit.

Create: the syntax

Right here’s the total Spark SQL pseudo-code syntax for creating an MLV, straight from the Microsoft Be taught reference:

CREATE [OR REPLACE] MATERIALIZED LAKE VIEW [IF NOT EXISTS]

[workspace.lakehouse.schema].MLV_Identifier

[(CONSTRAINT constraint_name CHECK (condition) [ON MISMATCH DROP | FAIL], ...)]

[PARTITIONED BY (col1, col2, ...)]

[COMMENT “description”]

[TBLPROPERTIES (”key1”=”val1”, ...)]

AS select_statement

An actual instance – cleansing order information joined from merchandise and orders, with an information high quality constraint and partitioning:

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.cleaned_order_data

(

CONSTRAINT valid_quantity CHECK (amount > 0) ON MISMATCH DROP

)

PARTITIONED BY (class)

COMMENT “Cleaned order information joined from merchandise and orders”

AS

SELECT

p.productID, p.productName, p.class,

o.orderDate, o.amount, o.totalAmount

FROM bronze.merchandise p

INNER JOIN bronze.orders o ON p.productID = o.productID

Two issues value flagging instantly. First, MLV names are case-insensitive (MyView turns into myview). Second, all-uppercase schema names (like MYSCHEMA) aren’t supported, so use both blended or lowercase.

You additionally want a schema-enabled lakehouse and Material Runtime 1.3 or increased. In case your lakehouse doesn’t have schemas turned on, MLVs aren’t accessible, that’s the very first prerequisite.

Refresh: the mind of MLVs

Right here’s the place MLVs cease being intelligent and begin being sensible.

When supply information modifications, Material’s optimum refresh engine appears at each MLV within the lineage and asks a sequence of questions: Did something really change? Can I course of simply the modifications? Or do I must rebuild from scratch?

Three doable outcomes:

Skip refresh – supply information hasn’t modified. Don’t waste compute. Transfer on.
Incremental refresh – course of solely the brand new or modified rows. Quick, low cost, excellent.
Full refresh – rebuild the entire thing. Slowest path, used when incremental isn’t protected or doable.

However, and that is vital, incremental refresh isn’t free. It has conditions:

The Delta change information feed (CDF) should be enabled on each supply desk referenced by the MLV (delta.enableChangeDataFeed=true).
The supply should be a Delta desk. Non-Delta sources all the time get a full refresh.
The information should be append-only. In case your supply has updates or deletes, Material falls again to a full refresh.
The question should use solely supported SQL constructs (extra on this in a second).

With out CDF enabled, optimum refresh can solely select between skip and full. With CDF on, the total incremental path opens up. Enabling CDF in your supply tables has no measurable storage or efficiency impression for append-only workloads, so there’s little or no cause to not flip it on:

ALTER TABLE bronze.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

ALTER TABLE bronze.merchandise SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

Can it get higher than this? Really, sure! And, that is the place the GA story actually begins.

What’s new within the Normal Availability section?

MLVs had been launched in preview at Construct 2025. Between then and GA in March 2026, Microsoft closed a very powerful gaps. 5 main modifications turned MLVs from “attention-grabbing” into “production-ready”:

Multi-schedule assist
Broader incremental refresh protection
PySpark authoring (preview)
In-place updates with Change
Stronger information qc

Let me take them separately.

1. Multi-schedule assist

In preview, you can solely refresh all MLVs in a lakehouse on a single schedule. Want finance to replace hourly, however analytics to replace each six hours? You needed to work round it with notebooks, which broke dependency consciousness, error reporting, and retry logic. Pocket book-triggered refreshes don’t floor MLV error particulars. Failures seem solely in cell output, and dependent views haven’t any consciousness of them. Errors can persist week after week with out anybody understanding the pipeline is damaged.

Now you may outline named schedules inside a single lakehouse, every concentrating on a selected subset of views. Finance pipeline hourly. Analytics each six hours. Advertising each quarter-hour. All in the identical lakehouse, no customized code.

When a named schedule runs, Material nonetheless refreshes all upstream dependencies within the right order, runs impartial views in parallel, and surfaces errors centrally. If a run is already in progress when a schedule fires, the brand new run is skipped, and the subsequent window proceeds as anticipated – so that you don’t have to fret about overlapping runs stomping on one another.

2. Broader incremental refresh

Incremental refresh used to fall again to full very often, as a result of the record of “supported” SQL constructs was slim. At GA, that record expanded considerably. MLVs now refresh incrementally when the definition consists of:

Aggregations like COUNT and SUM with GROUP BY
Left outer joins and left semi joins
Frequent desk expressions (CTEs)

That’s a significant change. Most real-world medallion pipelines I’ve labored on use precisely these patterns, and now they qualify for incremental processing with out being rewritten. With optimum refresh, a built-in resolution engine examines every refresh, evaluates the amount of modified information in opposition to the price of a full recomputation, and mechanically chooses the quicker path.

I hear you, I hear you: Nikola, what occurs if my question makes use of one thing the engine can’t deal with incrementally? Don’t fear, it’s a lot simpler than it sounds:) Utilizing unsupported constructs doesn’t forestall you from creating the MLV. It solely implies that Material makes use of a full refresh as an alternative of an incremental one. Optimum refresh mechanically falls again to full when wanted, so that you don’t usually must drive it. In case you do wish to drive one (for instance, to reprocess information after a correction), there’s a one-liner for that:

REFRESH MATERIALIZED LAKE VIEW silver.cleaned_order_data FULL;

3. PySpark authoring (preview)

This one is big! SQL is nice till your transformation logic entails a customized Python library, an ML inference name, or a UDF that wraps advanced enterprise guidelines. You then’d hit a wall as MLVs had been SQL-only.

With PySpark authoring, now you can create, refresh, and change MLVs from Material notebooks utilizing PySpark and the acquainted DataFrameWriter API. The fmlv module exposes a decorator-based sample, documented within the official PySpark MLV reference:

import fmlv

from pyspark.sql import features as F

@fmlv.materialized_lake_view(

identify=”LH1.silver.customer_silver”,

remark=”Cleaned & enriched buyer silver MLV”,

partition_cols=[”year”, “city”],

table_properties={”delta.enableChangeDataFeed”: “true”},

change=True

)

@fmlv.examine(identify=”non_null_sales”, situation=”gross sales IS NOT NULL”, motion=”DROP”)

def customer_silver():

df = spark.learn.desk(”bronze.customer_bronze”)

cleaned_df = df.filter(F.col(”gross sales”).isNotNull())

enriched_df = cleaned_df.withColumn(”sales_in_usd”, F.col(”gross sales”) * 1.0)

return enriched_df

A couple of PySpark gotchas value understanding about:

PySpark MLVs are nonetheless in preview on the time of writing.
At this time, PySpark-authored MLVs all the time carry out a full refresh. Optimum refresh for PySpark is on the roadmap, however not right here but.
The @fmlv decorator doesn’t assist dynamic parameters or variables. All parameters should be hardcoded.
You’ll be able to’t create an MLV from a PySpark short-term view (createOrReplaceTempView) – the engine can’t see session-scoped views. Use bodily Delta tables or different MLVs as sources.
Don’t delete the pocket book the place the MLV is outlined. Scheduled refresh fails with out it.

So in case your transformation will be expressed cleanly in SQL, SQL remains to be the higher alternative for efficiency. PySpark MLVs unlock the instances the place SQL alone received’t do.

4. In-place updates (Change)

Enterprise logic modifications. A filter shifts. A be part of good points a column. An aggregation provides a metric. In preview, updating an MLV definition required dropping and recreating it, which misplaced refresh historical past and compelled downstream customers to reconnect.

Now, with the Change functionality, you replace an MLV’s definition in place. Material validates the brand new logic, swaps it in, and preserves the view’s id, metadata, and lineage. Downstream dependencies stay intact. Works for each SQL (CREATE OR REPLACE) and PySpark (change=True).

That is a kind of “below the radar” GA options that doesn’t get headlines however issues enormously day-to-day. In case you’ve ever needed to coordinate dropping and recreating a closely consumed desk whereas manufacturing is operating, the ache. That goes away with this.

5. Stronger information high quality

Knowledge high quality constraints are nothing new in MLVs, however at GA, they received a critical improve. Now you can:

Use expression-based logic that mixes a number of columns
Apply arithmetic and built-in features inside a single rule
Invoke session-scoped user-defined features for validation logic that lives in Python slightly than SQL

Mix that with the auto-generated information high quality studies, and also you get one thing near a built-in information observability layer. You’ll be able to shortly spot which guidelines fail most frequently, which views they have an effect on, and the way traits shift over time, with out constructing a separate monitoring pipeline.

The lineage view – Dependencies free of charge

When one MLV references one other (or a desk), Material infers the connection mechanically. No guide configuration, no exterior orchestration software. The dependencies are found out of your SQL.

That dependency graph turns into a visible lineage in your lakehouse. Every node represents a metamorphosis. Arrows present execution order. Material makes positive that when bronze information lands, the bronze-to-silver MLV runs first, then the silver-to-gold MLV runs in opposition to the freshly up to date silver.

That is the place the declarative strategy actually pays off. You’re not writing pipelines. You’re not defining orchestration. You’re writing what every layer ought to appear like, and Material figures out the order. That is the great thing about a declarative strategy:)

A couple of helpful behaviors to learn about:

Impartial views run in parallel
Errors floor centrally as an alternative of getting misplaced in pocket book cell output
The lineage view auto-refreshes each two minutes when a run is in progress
All shortcuts are handled as supply entities within the lineage view
You’ll be able to connect a customized Spark surroundings to materialized lake views lineage to optimise efficiency and useful resource utilization throughout refresh

Knowledge high quality – Declared!

I touched on this above, nevertheless it deserves its personal part as a result of it’s one of many issues that makes MLVs really feel completely different from a hand-built pipeline.

Each MLV can have a number of information high quality constraints hooked up:

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.valid_orders

(

CONSTRAINT positive_quantity CHECK (amount > 0) ON MISMATCH DROP,

CONSTRAINT valid_date CHECK (orderDate >= ‘2020-01-01’) ON MISMATCH FAIL

)

AS

SELECT * FROM bronze.orders

Two motion sorts:

DROP – violating rows are eliminated, the depend is logged within the lineage view, and the pipeline retains going
FAIL – the refresh stops on the first violation. That is additionally the default in case you don’t specify

If a number of constraints are current and each behaviors are configured, FAIL takes priority.

Violations floor within the lineage view and run particulars. Fantastic, however what does that truly appear like in follow? Nicely, within the information high quality report, you’ll see counts by constraint, by view, over time. So if a constraint that usually drops 0.1% of rows abruptly drops 15%, you’ll see the spike and know precisely which rule failed and which view it belongs to. That’s a high quality sign you’d in any other case need to construct by hand.

The Microsoft docs additionally notice that the brand new expression-based constraints assist built-in Spark/SQL features like UPPER(), LOWER(), TRIM(), COALESCE(), INITCAP(), and DATE_FORMAT(), so your CHECK situations will be richer than simply easy comparisons.

When MLVs shine and once they don’t

MLVs are usually not a hammer for each nail. The Microsoft documentation is unusually direct about the place they match and the place they don’t.

Use MLVs when you might have:

Often accessed aggregations (day by day totals, month-to-month metrics) the place precomputed outcomes beat re-running costly queries
Complicated joins throughout a number of giant tables that should be constant for all customers
Knowledge high quality guidelines you wish to apply uniformly, declaratively
Reporting datasets that mix information from a number of sources and profit from computerized refresh
Medallion design sample – bronze to silver to gold outlined as SQL transformations

Don’t use MLVs when:

The question runs as soon as or not often – precomputing received’t assist
Transformations are easy and quick already
You want non-SQL logic like ML inference, API calls, or advanced Python processing – notebooks are nonetheless higher (although PySpark MLVs are beginning to bridge this hole)
You want sub-second latency for streaming – that’s Actual-Time Intelligence territory

I’ll add a private notice right here. I’m presently deep in a Microsoft Material engagement the place the silver layer design alternative – Warehouse vs. Lakehouse with MLVs – is actively on the desk. And what I preserve coming again to is that this: MLVs aren’t competing with the Warehouse the way in which some individuals body it. They’re competing with the spaghetti of notebooks-and-pipelines that you simply’d in any other case construct contained in the Lakehouse. In case your staff is already SQL-fluent and your transformations stay naturally in SELECT statements, the case for MLVs because the silver layer in a Lakehouse-based structure is genuinely robust.

The nice print – What to know earlier than you bounce in

I’d be doing you a disservice if I painted MLVs as a silver bullet. They’ve significant limitations, a few of which can matter to your structure:

No cross-lakehouse lineage and execution – all sources, MLVs, and dependencies should stay in the identical lakehouse. In case you’re utilizing a Material Knowledge Warehouse desk as a supply, you need to create a shortcut to it in your lakehouse first.
No DML statements – you may’t INSERT, UPDATE, or DELETE into an MLV. The information is regardless of the SELECT produces.
No time-travel queries within the definition – VERSION AS OF and TIMESTAMP AS OF aren’t allowed.
No UDFs within the SQL definition – although PySpark authoring fills this hole with session-scoped UDFs.
No short-term views as sources – the SELECT can reference bodily tables and different MLVs, however not temp views. This is applicable to PySpark too: createOrReplaceTempView() outputs aren’t seen to the MLV engine.
Session-level Spark properties don’t apply throughout scheduled refresh – set them on the lakehouse or workspace stage as an alternative.
Schema identify casing issues – all-uppercase schema names aren’t supported. Use blended case or lowercase.
Area availability – on the time of writing, MLVs aren’t accessible within the South Central US area.

None of those are showstoppers for many pipelines. However they’re value understanding earlier than you commit an structure to MLVs and uncover the limitation midway via.

Wrapping up

In case you’ve been constructing medallion design patterns in Material utilizing notebooks and pipelines, MLVs are value a critical look. They collapse 5 surfaces into one declarative layer. The dependency administration is computerized. The information high quality is in-built. The lineage is seen. And as of FabCon Atlanta, they’re production-ready.

The roadmap from Microsoft is evident: optimum refresh for PySpark-authored MLVs is coming, extra SQL operators will develop into incremental-refresh-eligible, and deeper integration with different Material workloads is on the way in which. This can be a milestone, not the end line – and I’m curious to see how MLVs evolve over the subsequent few quarters, particularly round PySpark incremental refresh and any cross-lakehouse story Microsoft would possibly inform.

Two takeaways I’d maintain onto:

The “T” in your ELT simply received so much simpler to write down, schedule, and belief – if that “T” is SQL.
MLVs don’t change each pocket book, each pipeline, or each Warehouse. However for declarative transformations that want lineage, refresh, and information high quality baked in, they’re now a legit default in Microsoft Material.

Thanks for studying!

Materialized Lake Views in Microsoft Material: When Your Medallion Matches in a SELECT Assertion

Making a PDF’s Pictures Searchable for RAG, With out Paying to Learn Them All

Python 3.14 and its New JIT Compiler

Related Posts

Making a PDF’s Pictures Searchable for RAG, With out Paying to Learn Them All

Python 3.14 and its New JIT Compiler

GPU-Resident Prime-Okay for Agentic RAG: I Constructed a CUDA Kernel So My Retrieval Step Would Cease Bouncing Off the GPU

Structured Outputs with LLMs: JSON Mode, Perform Calling, and When to Use Every

Your Churn Threshold Is a Pricing Determination

LLM Fallbacks Break Agent Pipelines — I Constructed the Lacking Restoration Layer

Leave a Reply Cancel reply

POPULAR NEWS

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

Easy methods to Use LLMs for Highly effective Computerized Evaluations

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

EDITOR'S PICK

Cease Constructing AI Platforms | In the direction of Information Science

Can AI-Pushed search engine marketing Instruments to Supercharge Your Advertising and marketing

Rocket Physician AI: Because the Healthcare Battle Rages, AI Can Assist Bridge the Divide

Implementing the Hangman Recreation in Python

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Materialized Lake Views in Microsoft Material: When Your Medallion Matches in a SELECT Assertion

READ ALSO

Materialized Lake View – WHAT?

The 4 levels of an MLV’s life

Create: the syntax

Refresh: the mind of MLVs

What’s new within the Normal Availability section?

1. Multi-schedule assist

2. Broader incremental refresh

3. PySpark authoring (preview)

4. In-place updates (Change)

5. Stronger information high quality

The lineage view – Dependencies free of charge

Knowledge high quality – Declared!

When MLVs shine and once they don’t

The nice print – What to know earlier than you bounce in

Wrapping up

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?