The right way to Scale back Your Energy BI Mannequin Measurement by 90%

Studying Triton One Kernel at a Time: Matrix Multiplication

Why AI Nonetheless Can’t Substitute Analysts: A Predictive Upkeep Instance

what makes Energy Bi so quick and highly effective in terms of efficiency? So highly effective, that it performs advanced calculations over tens of millions of rows within the blink of an eye fixed.

On this article, we are going to dig deep to find what’s “below the hood” of Energy BI, how your knowledge is being saved, compressed, queried, and eventually, introduced again to your report. When you end studying, I hope that you’re going to get a greater understanding of the exhausting work taking place within the background and recognize the significance of making an optimum knowledge mannequin to get most efficiency from the Energy BI engine.

First look below the hood — Method Engine and Storage Engine

First, I would like you to fulfill the VertiPaq engine, “mind & muscle tissue” of the system behind not solely Energy BI, but in addition Evaluation Providers Tabular and Excel Energy Pivot. Fact to be mentioned, VertiPaq represents just one a part of the storage engine inside the Tabular mannequin, in addition to DirectQuery, which we are going to talk about individually in one of many subsequent articles.

Once you ship the question to get knowledge in your Energy BI report, here’s what occurs:

Method Engine (FE) accepts the request, processes it, generates the question plan, and eventually executes it
Storage Engine (SE) pulls the info out of the Tabular mannequin to fulfill the request issued inside the question generated by the Method Engine

Storage Engine works in two alternative ways to retrieve requested knowledge: VertiPaq retains a snapshot of the info in reminiscence. This snapshot will be refreshed now and again from the unique knowledge supply.

Quite the opposite, DirectQuery doesn’t retailer any knowledge. It simply forwards the question straight to the info supply for each single request.

Knowledge within the Tabular mannequin is normally saved both as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nevertheless, there’s additionally a chance of implementing a hybrid Composite mannequin, which depends on each architectures in parallel.

Method Engine — “Mind” of Energy BI

As I already pressured, Method Engine accepts the question, and because it’s capable of “perceive” DAX (and MDX additionally, however it’s out of the scope of this collection), it “interprets” DAX into a selected question plan, consisting of bodily operations that have to be executed to get outcomes again.

These bodily operations will be joins between a number of tables, filtering, or aggregations. It’s necessary to know that Method Engine works in a single-threaded approach, which signifies that requests to Storage Engine are all the time being despatched sequentially.

Storage Engine — “Muscle tissues” of Energy BI

As soon as the question has been generated and executed by the Method Engine, the Storage Engine comes into the scene. It bodily goes by way of the info saved inside the Tabular mannequin (VertiPaq) or goes on to a special knowledge supply (SQL Server, for instance, if DirectQuery storage mode is in place).

With regards to specifying the storage engine for the desk, there are three attainable choices to select from:

Import mode — primarily based on VertiPaq. Desk knowledge is being saved in reminiscence as a snapshot. Knowledge will be refreshed periodically
DirectQuery mode — knowledge is being retrieved from the info supply at question time. Knowledge resides in its authentic supply earlier than, throughout, and after the question execution
Twin mode — a mix of the primary two choices. Knowledge from the desk is being loaded into reminiscence, however at question time it can be retrieved instantly from the supply

Versus Method Engine, which doesn’t assist parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we’ve drawn a giant image beforehand, let me clarify in additional element what VertiPaq does within the background to spice up the efficiency of our Energy BI stories.

Once we select Import mode for our Energy BI tables, VertiPaq performs the next actions:

Reads the info supply, transforms knowledge right into a columnar construction, encodes, and compresses knowledge inside every of the columns
Establishes a dictionary and index for every of the columns
Prepares and establishes relationships
Computes all calculated columns and calculated tables, and compresses them

The 2 primary traits of VertiPaq are:

VertiPaq is a columnar database
VertiPaq is an in-memory database

As you may see within the illustration above, columnar databases retailer and compress knowledge otherwise from conventional row-store databases. Columnar databases are optimized for vertical knowledge scanning, which signifies that each column is structured in its personal approach and bodily separated from different columns!

With out going into deep evaluation about benefits and downsides between row-store vs column-store databases, since it could require a separate collection of articles, let me simply pinpoint a number of key differentials when it comes to efficiency.

With columnar databases, single-column entry is quick and efficient. As soon as the computation begins to contain a number of columns, issues turn out to be extra advanced, because the middleman steps’ outcomes have to be quickly saved indirectly.

Merely mentioned, columnar databases are extra CPU-intensive, whereas row-store databases enhance I/O, due to many scans of ineffective knowledge.

Up to now, we painted a giant image of the structure that allows Energy BI to completely shine as an final BI software. Now, we’re able to dive deeper into particular architectural options and consequently leverage this data to benefit from our Energy BI stories, by tuning our knowledge mannequin to extract the utmost from the underlying engine.

Inside VertiPaq in Energy BI — Compress for fulfillment!

As you may recall from the earlier a part of this text, we scratched the floor of VertiPaq, a robust storage engine, which is “accountable” for the blazing-fast efficiency of most of your Energy BI stories (every time you might be utilizing Import mode or Composite mannequin).

3, 2, 1…Fasten your seatbelts!

One of many key traits of the VertiPaq is that it’s a columnar database. We discovered that columnar databases retailer knowledge optimized for vertical scanning, which signifies that each column has its personal construction and is bodily separated from different columns.

That reality permits VertiPaq to use various kinds of compression to every of the columns independently, selecting the optimum compression algorithm primarily based on the values in that particular column.

Compression is being achieved by encoding the values inside the column. However, earlier than we dive deeper into an in depth overview of encoding strategies, simply remember that this structure will not be completely associated to Energy BI — within the background is a Tabular mannequin, which can also be “below the hood” of Evaluation Providers Tabular and Excel Energy Pivot.

Worth Encoding

That is essentially the most fascinating worth encoding kind since it really works completely with integers and, subsequently, requires much less reminiscence than, for instance, when working with textual content values.

How does this look in actuality? Let’s say we’ve a column containing a lot of telephone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to search out the minimal worth on this vary (which is 4.000) as a place to begin, then calculate the distinction between this worth and all the opposite values within the column, storing this distinction as a brand new worth.

At first look, 3 bits per worth may not appear like a major saving, however multiply this by tens of millions and even billions of rows and you’ll recognize the quantity of reminiscence saved.

As I already pressured, Worth Encoding is being utilized completely to integer knowledge kind columns (forex knowledge kind can also be saved as an integer).

Hash Encoding (Dictionary Encoding)

That is in all probability essentially the most regularly used compression kind by a VertiPaq. Utilizing Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “actual” values with index values from the dictionary.

Right here is an instance to make issues clearer:

As you could discover, VertiPaq recognized distinct values inside the Topics column, constructed a dictionary by assigning indexes to these values, and eventually saved index values as tips to “actual” values. I assume you might be conscious that integer values require approach much less reminiscence area than textual content, in order that’s the logic behind any such knowledge compression.

Moreover, by with the ability to construct a dictionary for any knowledge kind, VertiPaq is virtually knowledge kind unbiased!

This brings us to a different key takeover: regardless of in case your column is of textual content, bigint or float knowledge kind — from VertiPaq perspective it’s the identical — it must create a dictionary for every of these columns, which suggests that every one these columns will present the identical efficiency, each when it comes to velocity and reminiscence area allotted! After all, by assuming that there aren’t any important variations in dictionary sizes between these columns.

So, it’s a fable that the info kind of the column impacts its dimension inside the knowledge mannequin. Quite the opposite, the variety of distinct values inside the column, which is called cardinality, principally influences column reminiscence consumption.

RLE (Run-Size-Encoding)

The third algorithm (RLE) creates a form of mapping desk, containing ranges of repeating values, avoiding storing each single (repeated) worth individually.

Once more, looking at an instance will assist to higher perceive this idea:

In actual life, VertiPaq doesn’t retailer Begin values, as a result of it could rapidly calculate the place the following node begins by summing earlier Depend values.

As highly effective as it would take a look at first look, the RLE algorithm is very depending on the ordering inside the column. If the info is saved the best way you see within the instance above, RLE will carry out nice. Nevertheless, in case your knowledge buckets are smaller and rotate extra regularly, then RLE wouldn’t be an optimum resolution.

Yet another factor to remember concerning RLE: In actuality, VertiPaq doesn’t retailer knowledge the best way it’s proven within the illustration above. First, it performs Hash encoding and creates a dictionary of the themes, after which applies the RLE algorithm, so the ultimate logic, in its most simplified approach, can be one thing like this:

So, RLE happens after Worth or Hash Encoding, in these situations when VertiPaq “thinks” that it is sensible to compress knowledge moreover (when knowledge is ordered in that approach that RLE would obtain higher compression).

Re-Encoding issues

Irrespective of how “good” VertiPaq is, it could additionally make some unhealthy choices, primarily based on incorrect assumptions. Earlier than I clarify how re-encoding works, let me simply briefly iterate by way of the method of information compression for a selected column:

VertiPaq scans a pattern of rows from the column
If the column knowledge kind will not be an integer, it should look no additional and use Hash encoding
If the column is of integer knowledge kind, some further parameters are evaluated: if the numbers within the pattern linearly enhance, VertiPaq assumes that it’s in all probability a major key and chooses Worth encoding
If the numbers within the column are moderately shut to one another (the quantity vary will not be very huge, like in our instance above with 4.000–5.000 telephone calls per day), VertiPaq will use Worth encoding. Quite the opposite, when values fluctuate considerably inside the vary (for instance between 1.000 and 1.000.000), then Worth encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nevertheless, it could occur typically that VertiPaq comes to a decision about which algorithm to make use of primarily based on the pattern knowledge, however then some outlier pops up and it must re-encode the column from scratch.

Let’s use our earlier instance for the variety of telephone calls: VertiPaq scans the pattern and chooses to use Worth encoding. Then, after processing 10 million rows, hastily it discovered a 500.000 worth (it may be an error, or no matter). Now, VertiPaq re-evaluates the selection, and it could determine to re-encode the column utilizing the Hash algorithm as an alternative. Absolutely, that will impression the entire course of when it comes to the time wanted for reprocessing.

Lastly, right here is the checklist of parameters (so as of significance) that VertiPaq considers when selecting which algorithm to make use of:

Variety of distinct values within the column (Cardinality)
Knowledge distribution within the column — column with many repeating values will be higher compressed than one containing regularly altering values (RLE will be utilized)
Variety of rows within the desk
Column knowledge kind — impacts solely the dictionary dimension

Decreasing the info mannequin dimension by 90% — actual story!

After we laid the theoretical floor for understanding the structure behind the VertiPaq storage engine, and which sorts of compression it makes use of to optimize your Energy BI knowledge mannequin, it’s the fitting second to get our arms soiled and apply our information in a real-life case!

Place to begin = 776 MB

Our knowledge mannequin is kind of easy, but memory-intensive. We now have a reality desk (factChat), which accommodates knowledge about dwell assist chats and one dimension desk (dimProduct), which pertains to a reality desk. Our reality desk has round 9 million rows, which shouldn’t be a giant deal for Energy BI, however the desk was imported as it’s, with none further optimization or transformation.

Now, this pbix file consumes a whopping 777 MB!!! You’ll be able to’t consider it? Simply have a look:

Simply keep in mind this image! After all, I don’t must let you know how a lot time this report must load or refresh, and the way our calculations are gradual due to the file dimension.

…and it’s even worse!

Moreover, it’s not simply 776 MBs that take our reminiscence, since reminiscence consumption is being calculated making an allowance for the next elements:

PBIX file
Dictionary (you’ve discovered in regards to the dictionary at first sections of this text)
Column hierarchies
Consumer-defined hierarchies
Relationships

Now, if I open Process Supervisor, go to the Particulars tab, and discover the msmdsrv.exe course of, I’ll see that it burns greater than 1 GB of reminiscence!

Oh, man, that basically hurts! And we haven’t even interacted with the report! So, let’s see what we will do to optimize our mannequin…

Rule #1 — Import solely these columns you actually need

The primary and crucial rule is: hold in your knowledge mannequin solely these columns you actually need for the report!

That being mentioned, do I really want each the chatID column, which is a surrogate key, and the sourceID column, which is a major key from the supply system? Each of those values are distinctive, so even when I must rely the overall variety of chats, I’d nonetheless be tremendous with solely certainly one of them.

So, I’ll take away the sourceID column and examine how the file appears to be like now:

By eradicating only one pointless column, we saved greater than 100 MB!!! Let’s look at additional what will be eliminated with out taking a deeper look (and we are going to come to this later, I promise).

Do we actually want each the unique begin time of the chat and UTC time, one saved as a Date/Time/Timezone kind, the opposite as Date/Time, and each going to a second degree of precision??!!

Let me eliminate the unique begin time column and hold solely UTC values.

One other 100 MB of wasted area gone! By eradicating simply two columns we don’t want, we decreased the scale of our file by 30%!

Now, that was with out even trying into the main points of the reminiscence consumption. Let’s now activate DAX Studio, my favourite software for troubleshooting Energy BI stories. As I already pressured a number of instances, this software is a MUST in case you plan to work significantly with Energy BI — and it’s utterly free!

One of many options in DAX Studio is a VertiPaq Analyzer, a really great tool constructed by Marco Russo and Alberto Ferrari from sqlbi.com. Once I connect with my pbix file with DAX Studio, listed below are the numbers associated to my knowledge mannequin dimension:

I can see right here what the costliest columns are in my knowledge mannequin and determine if I can discard a few of them, or if I must hold all of them.

At first look, I’ve few candidates for removing — sessionReferrer and referrer columns have excessive cardinality and subsequently can’t be optimally compressed. Furthermore, as these are textual content columns and have to be encoded utilizing a Hash algorithm, you may see that their dictionary dimension is extraordinarily excessive! For those who take a more in-depth look, you may discover that these two columns take virtually 40% of my desk dimension!

After checking with my report customers in the event that they want any of those columns, or possibly solely certainly one of them, I’ve acquired a affirmation that they don’t carry out any evaluation on these columns. So, why on Earth ought to we bloat our knowledge mannequin with them??!!

One other sturdy candidate for removing is the LastEditDate column. This column simply reveals the date and time when the file was final edited within the knowledge warehouse. Once more, I checked with the report customers, they usually didn’t even know that this column exists!

I eliminated these three columns, and the result’s:

Oh, God, we halved the scale of our knowledge mannequin by simply eradicating a number of pointless columns.

Fact be instructed, there are a number of extra columns that might be dismissed from the info mannequin, however let’s now deal with different strategies for knowledge mannequin optimization.

Rule #2 — Scale back the column cardinality!

As you could recall from the earlier a part of the article, the rule of thumb is: the upper the cardinality of a column, the more durable for VertiPaq to optimally compress the info. Particularly if we’re not working with integer values.

Let’s take a deeper take a look at VertiPaq Analyzer outcomes:

As you see, even when the chatID column has increased cardinality than the datetmStartUTC column, it takes virtually 7 instances much less reminiscence! Since it’s a surrogate key integer worth, VertiPaq applies Worth encoding, and the scale of a dictionary is irrelevant. Then again, Hash encoding is being utilized for the column of date/time knowledge kind with excessive cardinality, so the dictionary dimension is enormously increased.

There are a number of strategies for decreasing the column cardinality, reminiscent of splitting columns. Listed here are a number of examples of utilizing this system.

For Integer columns, you may cut up them into two even columns utilizing division and modulo operations. In our case, it could be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization approach should be carried out on the supply aspect (on this case, by writing a T-SQL assertion). If we use the calculated columns, there isn’t a profit in any respect, because the authentic column needs to be saved within the knowledge mannequin first.

The same approach can carry important financial savings when you could have decimal values within the column. You’ll be able to merely cut up values earlier than and after the decimal as defined in this text.

Since we don’t have any decimal values, let’s deal with our downside — optimizing the datetmStartUTC column. There are a number of legitimate choices to optimize this column. The primary is to examine in case your customers want granularity increased than the day degree (in different phrases, are you able to take away hours, minutes, and seconds out of your knowledge).

Let’s examine what financial savings this resolution would carry:

The very first thing we discover is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s outcomes present that this column is now virtually completely optimized, going from taking on 62% of our knowledge mannequin to simply barely over 2.5%! That’s huuuuge!

Nevertheless, it appeared that the day-level grain was not tremendous sufficient, and my customers wanted to investigate figures on the hour degree. OK, so we will no less than eliminate minutes and seconds, and that will additionally lower the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,topic
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my customers additionally didn’t want a chatVariables column for evaluation, so I’ve additionally eliminated it from the info mannequin.

Lastly, after disabling Auto Date/Time in Choices for Knowledge Load, my knowledge mannequin dimension was round 220 MB! Nevertheless, one factor nonetheless bothered me: the chatID column was nonetheless occupying virtually 1/3 of my desk. And that is only a surrogate key, which isn’t utilized in any of the relationships inside my knowledge mannequin.

So, right here I used to be analyzing two completely different options: the primary was to easily take away this column and mixture the variety of chats, counting them utilizing the GROUP BY clause. Nevertheless, there can be no profit by retaining the chatID column in any respect, because it’s not getting used wherever in our knowledge mannequin. As soon as I’ve eliminated it from the mannequin, one final time, let’s examine the pbix file dimension:

Please recall the quantity we began at: 776 MB! So, I’ve managed to cut back my knowledge mannequin dimension by virtually 90%, making use of some easy strategies which enabled the VertiPaq storage engine to carry out extra optimum compression of the info.

And this was an actual use case, which I confronted over the past 12 months!

Basic guidelines for decreasing knowledge mannequin dimension

To conclude, right here is the checklist of normal guidelines you need to bear in mind when attempting to cut back the info mannequin dimension:

Preserve solely these columns your customers want within the report! Simply sticking with this one single rule will prevent an unbelievable quantity of area, I guarantee you…
Attempt to optimize column cardinality every time attainable. The golden rule right here is: take a look at, take a look at, take a look at…and if there’s a important profit from, for instance, splitting one column into two, or to substitute a decimal column with two complete quantity columns, then do it! However, additionally remember that your measures have to be rewritten to deal with these structural modifications, to be able to show anticipated outcomes. So, in case your desk will not be massive, or if you must rewrite tons of of measures, possibly it’s not value splitting the column. As I mentioned, it depends upon your particular state of affairs, and you need to fastidiously consider which resolution makes extra sense
Similar as for columns, hold solely these rows you want: for instance, possibly you don’t must import knowledge from the final 10 years, however solely 5! That can even scale back your knowledge mannequin dimension. Speak to your customers, ask them what they really want, earlier than blindly placing every thing inside your knowledge mannequin
Combination your knowledge every time attainable! Meaning — fewer rows, decrease cardinality, so all good issues you might be aiming to realize! For those who don’t want hours, minutes, or seconds degree of granularity, don’t import them! Aggregations in Energy BI (and Tabular mannequin basically) are a vital and huge matter, which is out of the scope of this collection, however I strongly advocate you examine Phil Seamark’s weblog and his collection of posts on inventive aggregations utilization
Keep away from utilizing DAX calculated columns every time attainable, since they don’t seem to be being optimally compressed. As an alternative, attempt to push all calculations to a knowledge supply (SQL database, for instance) or carry out them utilizing the Energy Question editor
Use correct knowledge sorts (for instance, in case your knowledge granularity is on a day degree, there isn’t a want to make use of Date/Time knowledge kind. Date knowledge kind will suffice)
Disable Auto Date/Time choice for knowledge loading (this can take away a bunch of robotically created date tables within the background)

Conclusion

After you discovered the fundamentals of the VertiPaq storage engine and completely different strategies it makes use of for knowledge compression, I wished to wrap up this text by displaying you a real-life instance of how we will “assist” VertiPaq (and Energy BI consequently) to get one of the best out of report efficiency and optimum useful resource consumption.

Thanks for studying, hope that you simply loved the article!