Take a Deep Dive into Filtering in DAX

Scaling Characteristic Engineering Pipelines with Feast and Ray

Optimizing Token Era in PyTorch Decoder Fashions

We at all times use filters when growing DAX expressions, similar to DAX measures, or when writing DAX queries.

However what occurs precisely once we apply filters?

This piece is strictly about this query.

I’ll begin with easy queries and add variants to discover what occurs below the hood.

I exploit DAX Studio and the choice to indicate server timings for every question.

In case you need to study extra about this function and tips on how to interpret the outcomes, learn the primary article within the References part on the finish of this piece.

Let’s begin with the bottom question:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
					)

Determine 1 – Executing the question and the outcomes. The outcomes themselves should not necessary. However the execution statistics and the SE queries are (Determine by the Creator)

Once we activate the server timings and execute the question, we get the execution statistics and the Storage Engines (SE) question/queries wanted to get the information:

Determine 2 – The execution statistics for the question. We see the Storage Engine question to retrieve the outcomes (Determine by the Creator)

As you may see, we want just one Storage Engine (SE) question to retrieve the outcomes.

The question completes in solely 47 ms and is served virtually totally by the SE (95.7%).

The extra time the SE can spend on a question, the higher, as a result of it’s the part that retrieves information from the information shops and tables.
Furthermore, the SE can use a number of CPU cores, whereas the Formulation Engine (FE) can use just one. We can’t study precisely what occurs within the FE as simply as we are able to with SE queries.

You possibly can study extra in regards to the distinction between these two engines within the article talked about above.

A brief observe:

A couple of months in the past, I wrote an article right here with a really related title. However, whereas that one was solely about date filters with Time Intelligence features, this one goes one step deeper into the rabbit gap.

That is rather more generic than that one.

In case you missed it, I added the article hyperlink and extra sources on the present matter to the References part beneath.

Add easy filters

Subsequent, we add a easy filter for the product colour pink to the question:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,'Product'[ColorName] = "Crimson"
				)

Right here is the question and the outcomes restricted to the product colour pink:

Determine 3 – The question and the outcomes for the information restricted to the product colour pink (Determine by the Creator)

Once we take a look at the question statistics, we see this:

Determine 4 – The question statistics and the SE question with the filter for pink merchandise (Determine by the Creator)

As you may see, your complete question is executed in a single SE question.

The filter is within the question’s WHERE clause. Due to this fact, solely the restricted information is retrieved.

That is seen within the “Rows” column, as solely 14 rows are returned from this question.

However what occurs once we use the FILTER() perform to filter the merchandise:

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,FILTER('Product'
						,'Product'[ColorName]  = "Crimson")
				)

As you would possibly know, utilizing the FILTER() perform isn’t beneficial as a result of the way it works.

You possibly can study extra about this matter within the second article linked within the References part beneath.

The end result doesn’t change:

Determine 5 – The question and the outcomes for the information restricted by the FILTER() perform to the product colour pink (Determine by the Creator)

However how does it have an effect on the execution plan and the SE queries?

Determine 6 – The question statistics and the SE question with the filter for pink merchandise, however with the FILTER() perform (Determine by the Creator)

As you may see, on this case, the SE optimizes the question, yielding the identical execution plan as earlier than.

However, as we modify our code, we’ll see that utilizing FILTER() isn’t at all times a good suggestion.

Add a number of filters

Now, what occurs once we add a number of filters to a question?

EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"On-line Gross sales", [Sum Online Sales]
                        )
                ,'Product'[ColorName]  = "Crimson"
                ,'Geography'[ContinentName] = "Europe"
                )

Whereas the end result isn’t that fascinating to us, let’s take a look at the question statistics:

Determine 7 – Question statistics when making use of a number of filters to an expression (Determine by the Creator)

Once more, the question could be served by a single SE question that incorporates each filters.

The question executes so shortly that the FE time share is comparatively excessive, but it nonetheless solely takes 6ms.

When altering the question to make use of the FILTER() perform, the SE question doesn’t change both:

Determine 8 – Question statistics when making use of a number of filters with FILTER (Determine by the Creator)

This reveals that, with this type of question, the engine can optimize execution to search out essentially the most environment friendly strategy to fulfill the DAX question.

Anyway, the end result doesn’t change. It’s equivalent in each circumstances, accurately, as a result of we don’t change the filter per se. However please be affected person with me; I’m getting again to the FILTER() perform and why it’s necessary to grasp its results in a second.

Transferring filters into measures

Subsequent, let’s see what occurs when the filter is moved into the measure.

Till now, the question was constructed in order that the measure [Sum Online Sales] acquired its filter from outdoors.

Let’s do this:

DEFINE 
MEASURE 'All Measures'[Online Sales A. Datum] =
		CALCULATE(
			SUMX('On-line Gross sales', ( 'On-line Gross sales'[UnitPrice] * 'On-line Gross sales'[SalesQuantity]) - 'On-line Gross sales'[DiscountAmount] )
			,'Product'[BrandName] = "A. Datum"
			)


EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales A. Datum", [Online Sales A. Datum]
						)
				)

As you may see, the filter is utilized contained in the measure [Online Sales A. Datum].

After all, the ensuing quantity is identical in every row of the end result, because the Model is about as “A. Datum”:

Determine 9 – The identical end result for every model with the measure containing a filter (Determine by the Creator)

However the execution is barely completely different:

Determine 10 – Question statistics for the measure containing the filter (Determine by the Creator)

This time, we’ve got two SE queries.

The question to get the gross sales for the Model “A. Datum”. This question incorporates the filter for that model.
The second question is used to get the listing for all manufacturers within the end result set.

The primary question is most necessary to us, as a result of it nonetheless reveals the filter for the model set inside the measure.

This question could be absolutely served by the SE with a easy filter in a really environment friendly manner.

However, normally, we need to add a number of measures to a question (or a visible in a report).

What occurs once we add the [Sum Online Sales] measure to the question?

The end result isn’t significantly necessary, because it reveals one column with gross sales for every model and one other with gross sales for the filtered model.

However the question statistics are fascinating:

Determine 11 – Question statistics for the question with one measure containing a filter and one other with out. The place is the filter set within the measure? (Determine by the Creator)

As you may see within the red-marked line within the SE question, the Model filter is now not current.

As a result of the engine acknowledges that the filter within the measure is utilized to the identical column because the one within the question, it strikes the filter to the FE and returns the end result.

Now, what occurs once we filter one other column within the measure, for instance, the colour:

DEFINE 
MEASURE 'All Measures'[Online Sales Red] =
        CALCULATE(
            SUMX('On-line Gross sales', ( 'On-line Gross sales'[UnitPrice] * 'On-line Gross sales'[SalesQuantity]) - 'On-line Gross sales'[DiscountAmount] )
            ,'Product'[ColorName] = "Crimson"
            )


EVALUATE
    CALCULATETABLE(
                SUMMARIZECOLUMNS('Product'[BrandName]
                        ,"On-line Gross sales", [Sum Online Sales]
                        ,"On-line Gross sales Crimson", [Online Sales Red]
                        )
                )

Once more, the end result isn’t significantly fascinating. We have an interest within the question statistics:

Determine 12 – Question statistics for 2 measures with one filtering a column not contained within the end result set (Determine by the Creator)

As you may see, this time we’ve got two queries by BrandName. One with out and one with the filter for the colour.

Each queries return the identical variety of rows (14) – one for every Model.

The FE handles combining the 2 outcomes right into a single desk.

The whole question remains to be served primarily by the SE, which is superb.

However now, let’s add the FILTER() perform to the Filter:

For this instance, I modify the measure to filter for 2 values with the IN operator:

,'Product'[BrandName] IN { "A. Datum", "Journey Works" }

On this variant, the SE question is like those earlier than.

The filter is handed immediately into the question’s WHERE clause.

However what occurs once I change it to this:

,FILTER('Product'

       ,'Product'[BrandName] IN { "A. Datum", "Journey Works" }

       )

To begin with, the end result modifications:

Determine 13 – Results of the measures with and with out FILTER(). In pink, the outcomes with the straightforward filter, and in blue, the outcomes for the measure with FILTER() (Determine by the Creator)

The reason being that FILTER() works utterly in a different way.

It retains the present filter context and provides a brand new one.

I defined this conduct in one other article that I added because the second hyperlink within the References part beneath.

Furthermore, the SE can’t deal with this in a single question anymore:

Determine 14 – The question statistics for utilizing the FILTER() perform within the measure. Now, a number of queries are wanted. (Determine by the Creator)

The primary two queries retrieve the values for the model to filter (See the queries marked in pink).

Discover the massive variety of rows (324 and a pair of’560) returned by the primary two queries. That is the materialization of intermediate outcomes wanted to carry out the calculation.

The third question makes use of these intermediate outcomes to filter the information (marked in pink).

The results of the third question is barely two rows—the 2 rows we see within the general end result.

As described in my different article, FILTER() should be used with care.

Not solely is it significantly slower, nevertheless it additionally works utterly in a different way from a easy filter.

Anyway, I can restore the earlier conduct by including an ALL() within the FILTER() name:

Determine 15 – Including an ALL() inside the FILTER() name restores the semantics of the straightforward filter. However why would you do that? (Determine by the Creator)

I don’t need to conceal the truth that this instance is particular, because the filter utilized impacts the identical column as used within the question.

When altering the question to filter the nation, the engine can optimize the execution and use the straightforward kind once more:

Determine 16 – Right here we see that when filtering columns completely different from the columns used within the DAX question, the engine can optimize the execution and fall again to use a easy filter. Within the blue inset, you see the outcomes (Determine by the Creator)

As you may see, the engine optimizes the execution of the question and falls again to a easy filter when filtering columns that differ from these used within the DAX question. Within the blue inset, you see the outcomes.

I see this type of filtering fairly often when builders who should not as proficient write DAX measures.

Utilizing the FILTER() perform seems intuitive, however it may possibly yield incorrect or complicated outcomes and is slower than a easy filter. I strongly advocate studying my article linked beneath about this perform, in addition to the dax.information documentation and the articles linked on SQLBI.com.

Moreover, I’ve to kind rather more than when utilizing a easy filter.

As a lazy man, this is a crucial cause to not use FILTER() when it’s pointless.

Add a fancy filter

Lastly, I need to present what occurs when making use of a filter utilizing a DAX perform, similar to CONTAINSSTRING().

EVALUATE
	CALCULATETABLE(
				SUMMARIZECOLUMNS('Product'[BrandName]
						,"On-line Gross sales", [Sum Online Sales]
						)
				,CONTAINSSTRING('On-line Gross sales'[SalesOrderNumber], "202402252C")
				)

Such a question is executed whenever you use a slicer in your report back to filter for a selected order and retrieve the manufacturers of the bought merchandise.

Because the end result isn’t necessary at this level, let’s immediately take a look at the question statistics:

Determine 17 – Question statistics for the question utilizing a DAX perform to filter the end result. You possibly can see that nearly the whole thing of the execution is carried out within the FE (Determine by the Creator)

Whereas the question took greater than 6 seconds to finish, 99.6% of the time was spent by the FE executing the CONTAINSSTRING() perform to search out matching rows within the information. This operation could be very CPU-intensive, because the FE can use just one core. After I execute this question on my laptop computer, it takes greater than 2 seconds longer.

I intentionally selected a sluggish perform to reveal its results.

However the SE was nonetheless capable of execute the question with a single question. Nonetheless, the constructive impact of this truth is negligible on this case.

Conclusion

Whereas it’s not my intention to offer you recommendation on what to do and what to not do, I needed to indicate you the implications of the alternative ways to write down DAX code and apply filters in your measures or queries.

The DAX engine(s) are very environment friendly in optimizing the queries, however they’ve limitations.

Due to this fact, we should at all times take care when writing our DAX code.

If the efficiency is poor or the code written by another person seems unusual, we must always analyze it to find out tips on how to enhance it.

I needed to indicate you tips on how to do it and what to search for when analyzing your DAX code.

Bear in mind:

The Storage engine (SE) can use a number of CPU cores.
The extra work is finished by the SE, the higher.
The SE can execute solely easy aggregations and simple arithmetic features (like +, -, x, and /)
Attempt to scale back the workload on the Formulation Engine (FE)
The FE can use just one CPU core.
Attempt to scale back the materialization of knowledge (The Rows column within the question statistics).
Attempt to scale back the variety of SE queries.

I do know that the necessities will pressure us to write down DAX code, which isn’t optimum.

Even worse, the Report designers would possibly add logic to the report that causes a poor efficiency.

In such circumstances, get rid of that logic and test the response time once more. It could be value exploring making a devoted measure for such circumstances. Keep in mind that it’s attainable to create native measures in a report that’s related to a Semantic mannequin through a life connection.

However most significantly: Take your time when writing DAX code. You would possibly save time by avoiding the necessity to optimize your DAX code, which was written in a rush. I communicate from expertise. It is a very unhealthy feeling.

I hope you realized one thing new.

References

To study the main points about tips on how to interpret the outcomes of the Server Timings in DAX Studio, learn this piece:

Are you interested in tips on how to use the FILTER() perform appropriately? Learn this:

One other DAX perform that may hurt efficiency is KEEPFILTERS(). To study extra in regards to the KEEPFILTERS() perform, learn this piece:

Right here, the talked about piece about date filters:

An fascinating weblog put up by Information Mozart in regards to the Storage engine:

Like in my earlier articles, I exploit the Contoso pattern dataset. You possibly can obtain the ContosoRetailDW Dataset without cost from Microsoft right here.

The Contoso Information can be utilized freely below the MIT License, as described on this doc. I modified the dataset to shift the information to modern dates.