The Importance of Sort Order
This past week I taught the SSAS Masters class which is one of the virtual training classes offered by Pragmatic Works. One of the things we discuss in the class is the importance of sorting the fact data in your data warehouse in preparation for Analysis Services. Simply by sorting your fact data, you can see much improved compression which can also improve your query response time, as well. But how much improvement in compression and query response could you see? Well that’s what I set out to discover by running a couple little tests.
Better Compression? Yes, please.
I started with my beloved Fact Sales measure group in the Contoso Retail Operations cube. The Fact Sales measure group utilizes a named query in the DSV that is a basic select statement from the Fact Sales table in the Contoso Retail database. I checked the size of the single partition that made up the measure group and saw that is was just over 129 MB in size. Not big but I thought we could improve that.
So I set out to sort my data. The trick to sorting your data is to start by selecting your top three fields with the least amount of cardinality (or uniqueness). Try experimenting with different sorting to see what kind of results you can get. For the FactSales table, I started with PromotionKey, CurrencyKey, and ChannelKey then went from there. I simply set my partition to Query Bound and utilized the following query:
SELECT TOP 2147483647 CONVERT (INT, CONVERT (CHAR (8), DateKey, 112)) AS DateKey,
channelKey,
StoreKey,
ProductKey,
PromotionKey,
CurrencyKey,
UnitCost,
UnitPrice,
SalesQuantity,
ReturnQuantity,
ReturnAmount,
DiscountQuantity,
DiscountAmount,
TotalCost,
SalesAmount,
ETLLoadID,
LoadDate,
UpdateDate
FROM dbo.FactSales
ORDER BY PromotionKey, CurrencyKey, ChannelKey, StoreKey, ProductKey, DateKey;
Arguably there are better ways to sort the data for SSAS but that’s not the point of this blog post so I’ll leave that for you to decide.
I did a quick redeployment of the cube and processed the Fact Sales measure group.
The partition size dropped down to 93.11 MB in size! That’s a whopping 28% decrease in size! Awesome!
28% is a pretty big storage savings, especially when we could potentially be dealing with a lot more data in an enterprise scenario. Personally, I’ve seen storage savings up to 45% simply by sorting the data in the relational engine.
Better Query Performance? Sign me up!
With small fact .data files, we should see better query performance, right? I mean, theoretically it makes sense, but I was curious about how much better query performance we would see. So I set out with another little experiment.
First, I used Excel to generate a nasty little MDX query for my testing, which I captured with profiler:
SELECT
{
[Measures].[Sales Amount]
,[Measures].[Sales Quantity]
,[Measures].[Sales Unit Cost]
}
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME
,HIERARCHY_UNIQUE_NAME
ON COLUMNS
,NON EMPTY
CrossJoin
(
Hierarchize
(
{
DrillDownLevel
(
{[Date].[Calendar Week].[All Date]}
,,,INCLUDE_CALC_MEMBERS
)
}
)
,Hierarchize
(
{
DrillDownLevel
(
{[Product].[Product Name].[All Products]}
,,,INCLUDE_CALC_MEMBERS
)
}
)
)
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME
,HIERARCHY_UNIQUE_NAME
,[Product].[Product Name].[Product Name].[Product Available For Sale Date]
,[Product].[Product Name].[Product Name].[Product Brand Name]
,[Product].[Product Name].[Product Name].[Product Category Description]
,[Product].[Product Name].[Product Name].[Product Category Label]
,[Product].[Product Name].[Product Name].[Product Class]
,[Product].[Product Name].[Product Name].[Product Color]
,[Product].[Product Name].[Product Name].[Product Description]
,[Product].[Product Name].[Product Name].[Product Image URL]
,[Product].[Product Name].[Product Name].[Product Label]
,[Product].[Product Name].[Product Name].[Product Manufacturer]
,[Product].[Product Name].[Product Name].[Product Size Range]
,[Product].[Product Name].[Product Name].[Product Size Unit Measure]
,[Product].[Product Name].[Product Name].[Product Status]
,[Product].[Product Name].[Product Name].[Product Stock Type]
,[Product].[Product Name].[Product Name].[Product Stop Sale Date]
,[Product].[Product Name].[Product Name].[Product Style]
,[Product].[Product Name].[Product Name].[Product Subcategory Description]
,[Product].[Product Name].[Product Name].[Product Subcategory Label]
,[Product].[Product Name].[Product Name].[Product Subcategory Name]
,[Product].[Product Name].[Product Name].[Product Unit Of Measure]
,[Product].[Product Name].[Product Name].[Product URL]
,[Product].[Product Name].[Product Name].[Product Weight Unit Measure]
,[Date].[Date].[Date].[Asia Season]
,[Date].[Date].[Date].[Calendar Month]
,[Date].[Date].[Date].[Calendar Week Day]
,[Date].[Date].[Date].[Date Description]
,[Date].[Date].[Date].[Europe Season]
,[Date].[Date].[Date].[Fiscal Month]
,[Date].[Date].[Date].[Is Work Day]
,[Date].[Date].[Date].[North America Season]
ON ROWS
FROM
[Operation]
CELL PROPERTIES
VALUE
,FORMAT_STRING
,LANGUAGE
,BACK_COLOR
,FORE_COLOR
,FONT_FLAGS;
I then modified my partition to use the unsorted data, redeployed, and reprocessed. Executing the query against the Contoso database with a cold cache returned the following execution time, which I captured with Profiler again:
The query finished in just over 56 seconds. Against a warm cache, the query finished in about 50 seconds.
I once again altered my partition to be query bound the TSQL query previously mentioned, redeployed, reprocessed, cleared the cache, and ran my query. This time my query finished executing 49 seconds!
So simply by sorting the data for loading into my partitions, I saved 28% storage space and improved my query’s performance by 11%! Not bad for about 10 minutes worth of work, huh? I conducted my tests several times and each time the results were about the same.
The Tradeoff
It’s not all sunshine and roses. There is a slight drawback that you should be aware of and it has to do with additional time spent processing. By adding the Order By clause to your queries for the partitions, the queries will probably take longer to execute. This is going to add time to processing. This means you have to decide if you can live with the additional time processing in order to gain improved compression and query performance. Depending on many factors, the additional time spent processing could be minimal… or not. But you’ll have to decide if the additional processing time is worth the improvements.
The Conclusion
The lesson to be learned here is the importance of sorting your data for loading into your partitions. The performance improvements seen by simply improving compression of your partitions by sorting your data is an easy way to improve storage of your data as well as query performance.
I’d be interested to see if any of my readers could conduct their own tests and see what kind of performance benefits they see. So if you have a few minutes of your own, try this out and then leave a comment with your results. Good luck!
Hi Dustin, I love this trick! I just did this with a client. On two partitions we saw 78% and 81% size reduction, and faster processing time. The percentages were a pleasant surprise! Does the higher compression of the details make it simpler to process aggregations?
You may see some compression increase and some processing improvements with the aggregations but probably not as much because the data is already at a less granular level. But that would be something to test. And it’s just going to depend on the data. If you try that let me know what you learn.