There’s a load of new features that are included in the release of SQL Server Analysis Services 2016 CTP2. I’m pretty excited about these changes and while these changes have been public for a while now, I’d like to share my thoughts. I’ll say that these features are included in the SSAS 2016 CTP2 release. This release does not include all the enhancements to SSAS 2016 and these enhancements are subject to change. You can read about the enhancements here. Continue reading What’s New in SQL Server Analysis Services 2016?
Tag Archives: SSAS
Three SSAS Best Practices to Follow
There is a lot of work that goes into performance tuning a SQL Server Analysis Services solution for a client. And even though there may be many steps involved in identifying performance and management issues with a large SSAS solution, there are a few things that we can quickly check and implement to give us quick wins for improving the performance of our cube. In this post, I’d like to quickly point out three best practices that we can follow to improve performance and create a more positive experience for our users. These are not meant to be the top three best practices to follow, but rather three (among many) very important best practices you should follow. I believe following these three best practices will make a difference in your solution.
Create Hierarchies with Attribute Relationships
In my opinion, creating natural hierarchies are the single most beneficial thing an SSAS developer can do to improve the performance and usability of a cube. There are several reasons correctly defined user hierarchies are beneficial, but here are a couple of the top reasons.
Increased Query Performance
Creating attribute relationships between attributes that are included in a user defined hierarchy improve the performance of queries using these attributes for a couple different reasons. In this case, we’ll look at the Calendar hierarchy in the Adventure Works Date dimension. The Calendar hierarchy has five levels with the Calendar Year attribute at the top level and the Date attribute at the bottom level of the hierarchy. We can also see the following attribute relationships created to give SSAS an understanding of how the members of these attributes relate to one another. Once created, these relationships give SSAS the ability to understand that a given date member is related to a single month member, a given month member relates to a single quarter, and so on and so forth.
This also means that during processing, special indexes are created that map the relationships between each member of each level in the hierarchy. This means that before a query is written, SSAS also knows to which month, quarter, semester, and year each date aggregates. These indexes are only created if you correctly define the attribute relationships between the attributes in the hierarchy.
Improved User Experience
User defined hierarchies also improve the user’s experience with the dimension because the hierarchy provides a very natural way for the user to navigate through the data in the dimension. A user defined hierarchy does two things for a user: 1) Using the hierarchy obviously presents the data to the user in an understandable and clear way. And 2) the hierarchy organizes the data based on your business requirements. This means that the user doesn’t have to think about how should these attributes be organized or labeled because you’ve done that for them. The user can focus on reading their reports, understanding the data, and making business decisions.
Partition Measure Groups to Separate Volatile and Static Data
Measure groups that are larger than (about) one million records should be separated into multiple partitions. There are several advantages to separating larger measure groups into multiple partitions.
Increased Processing Performance
Partitions in a measure group are processed in parallel. This means that a measure group containing three years worth of data separated into one partition for each month will process faster than a measure group with all three years worth of data in a single partition. Your processing strategy for each measure group may vary depending on the amount of data. For instance, a large telecommunication company may collect hundreds of millions of records per day, dictating a more complex and granular partitioning strategy.
Also, we should consider which partitions contain data that is changing vs. data that is now static. If our businesses will continually log transaction in the current month, we can partition our data by month effectively separating our volatile data from the static data. This means we only need to process one months worth of data in our cube to pick up the latest changes instead of having to processing the whole measure group.
Increased Query Performance
By partitioning our data at the month level, certain queries will also perform better. For instance, imagine a user queries our measure group searching for last month’s sales figures. Because we have partitioned our measure group with each month being loaded into a separate partition, this means that the entire measure group does not have to be scanned. Only the partition containing the data for the month in question needs to be queried.
Create Aggregations for Large Measure Groups
You can think of Aggregations as indexes for SSAS. Aggregations are used to help SSAS find the answer to a user’s query faster. Aggregations are typically used for measure groups that are large and take a considerable amount to query.
Improved Query Performance
Aggregations contain the data of a measure group at a summary level typically higher than the lowest level of data included in the measure group. Aggregations are populated during the processing phase of the partition. You can think of Aggregations as exactly the same as the lowest level of the measure group just at a summary level. This means that when SSAS receives a query that can be answered using an Aggregation, SSAS does not have to spend the additional time required to retrieve the measure group data from the lowest level and roll up the data to the requested level because the Aggregation design already contains the data at the desired level.
Beware Too Many Aggregations
Because Aggregations are built during the processing phase, this means that for every aggregation you define more time is required to build the aggregations. This also means that it is especially important to only build useful aggregations that are necessary to improve performance. By creating aggregations that are not often used, you can degrade query performance and increase processing time with little to no benefit. Like indexes on a SQL Server table, too many aggregations or the wrong aggregations can actually hurt performance, so make sure you test, test, and then test to ensure your aggregations are helping your query performance.
I hope you have found this information useful. By following these three best practices, I really believe you’ll see a generous improvement in the performance of your SSAS cubes.
If you found this information helpful, I’d love to hear from you! Please leave me a comment and let me know what you think. Do you have any ideas on something I left out or should have included? Please let me know! Thank you!
Navigating Hierarchies with MDX webinar recording is now available!
Thanks to everyone that attended my webinar on Navigating Hierarchies with MDX! We looked at a bunch of different ways we can navigate up, down, and side to side in our hierarchies in order to do some really neat things with calculations. If you would like to view the recording, you can do that here completely for free! Also, if you’d like to view my PowerPoint slide deck and scripts I used for the webinar, you can download those from here. Just download the Navigating Hierarchies with MDX .zip file. 🙂
Now on to the questions!
Q: Is there anything like storeproc / pre stroed mdx query in SSAS which can be called in .net application.
A: There is a concept of SSAS stored procedures, which you can read more about here: http://msdn.microsoft.com/en-us/library/ms176113.aspx and here: http://asstoredprocedures.codeplex.com/
Q: How would ParallelPeriod handle a leap year?
A: ParallelPeriod returns the member at the same position in the specified period. So if the 29th day of February does not exist in the previous year, then no value will be returned:
Q: Can you use PeriodsToDate() on a ‘custom’ period like an Academic Term?
A: PeriodsToDate can be used on any user defined hierarchy.
Q: Setting date property for MTD, QTD seems straightforward. What about WTD (week-to-date)? It seems it might take some careful work each year.
A: If you have a Week attribute in your Date dimension, that should be set to Week, as well. That’s all that is required. 🙂
Q: What is the name of the zoom tool and highlighter used, just curious.
A: Zoomit. It’s free, too! I get asked that question every time I present.
Thanks for all the great questions, everyone! If you have any further questions, please feel free to post it here or to send me a tweet!
The MDX Rank Function
We can create a numeric ranking of the Customers by leveraging the MDX Rank function. The Rank function Continue reading The MDX Rank Function
Creating Calculated Members with MDX
This week I got an email from a reader named Brad asking a specific question regarding MDX. Here is his question:
I need an MDX query to return me the top 5 values and then default all others to an “other” group and ranks based on the resulting measure value. For instance, if we are talking about a measure by month for a full year, I need the output to look like this:
Dim Measure Rank
Nov $500 1
Mar $400 2
Feb $300 3
Jan $200 4
Dec $100 5
Other $350 6Any input will be most appreciated!
Thanks,
Brad
I thought this was a good question which is why I thought I’d take the time to blog this out. In this example I’ll be using the Adventure Works cube and the Adventure Works Customer dimension.
The first step is to determine what are our Top 5 Customers. To do this, we can use a simple TopCount function to build a set of the Top 5 Customers.
with dynamic set [Top 5 Customers] as
Topcount(
[Customer].[Customer].[Customer].members,
5,
[Measures].[Internet Sales Amount]
)
Once we have our Top 5 Customers, then we can easily determine everyone else. To create our custom “Others” group, we need to create a Calculated Member (Custom Member). To create the Calculated Member, you need to specify the dimension, attribute hierarchy, and member name for your custom member (ie [Customer].[Customer].[Others]).
member [Customer].[Customer].[Others] as
Aggregate(
Except(
[Customer].[Customer].[Customer].members,
[Top 5 Customers]
)
)
To create our Others custom member, we need to use the Except function to basically specify that we want our custom member to include all our customers except the Top 5 Customers set we previously created. Also, don’t forget to wrap the set in the Aggregate function.
Lastly, we create our set that will actually be used in our query. This set will include our first set containing our Top 5 Customers as well as our Others custom member. We can use the Top 5 Customers set with our custom member because all members in the set are from the same Customers hierarchy.
dynamic set [Top 5 & Others] as
{
[Top 5 Customers],
[Customer].[Customer].[Others]
}
with dynamic set [Top 5 Customers] as
Topcount(
[Customer].[Customer].[Customer].members,
5,
[Measures].[Internet Sales Amount]
)
member [Customer].[Customer].[Others] as
Aggregate(
Except(
[Customer].[Customer].[Customer].members,
[Top 5 Customers]
)
)
dynamic set [Top 5 & Others] as
{
[Top 5 Customers],
[Customer].[Customer].[Others]
}
Select [Measures].[Internet Sales Amount] on 0,
[Top 5 & Others] on 1
From [Adventure Works]
I hope that was pretty straight forward and not too confusing. If you thought this was helpful leave me a comment and let me know! And if you have any questions or comments, feel free to leave a comment.
Thanks again, Brad, for the great question!
Check out my next blog post to see how to return the numeric rank of the customers!
Do You Know Why Your MDX Query Is Slow?
Performance tuning MDX queries can often be a daunting and challenging task. But the first step in deciding where to begin your efforts to improve the performance of your query is to diagnose the source of the problem. There are two areas that could be causing our performance issue: 1. The design of our SSAS solution or 2. The design of our query. We need to figure out if we’ve written a bad query or designed a bad cube :).
How Do We Test The Query?
Usually an issue is discovered when a user comes to the BI team with a report that appears to be running slowly. So for our example, I have a pivot table in an excel workbook that is running a little slow. Let’s walk through this together and diagnose what the problem could be. Below here you can see the pivot table.
The first thing you’ll need to do is test the query and the best way to do this is to execute the query in isolation so that we can eliminate outside factors as part of the problem. So in this example I would execute the report outside of business hours. I want to make sure I do this outside of peak use time because one of the things I’ll need to do is clear the cache. In order to prevent SSAS from satisfying our query by leveraging the formula cache and the storage engine cache, we need to execute a ClearCache command to prevent our results from being corrupted. To do this, I’ll execute the following XMLA script in SSMS.
<ClearCache xmlns=”http://schemas.microsoft.com/analysisservices/2003/engine”> <Object> <DatabaseID>AdventureWorksDW2012MutliDimensional-EE</DatabaseID> <CubeID>Adventure Works</CubeID> </Object> </ClearCache>
Also, I’ll execute the following MDX to initialize my cube’s MDX script.
Select {} on 0 From [Adventure Works]
For us to figure out how long our query is taking to execute, we’re going to fire up SQL Server Profiler and execute a trace against SSAS while we execute the query so we can gather all the nitty gritty details of our query execution. By running a trace, we will be able to see all kinds of really helpful details like the total duration of the query, the partitions being queries, if aggregations are being used to answer the query, which attributes are being used, and much more.
To begin our trace, go to Tool > SQL Server Profilers in SSMS.
Flip the Server type to Analysis Services and set the Server name to your SSAS instance that is home to the SSAS database your report is pointed at.
Next you’ll see the Trace Properties window. Go to the Events Selection tab, and check the check box near the bottom right of the window to Show all events. Scroll down a little more than half way and find the events “Get Data From Aggregation” and “Query Subcube Verbose.” The “Get Data From Aggregation” event is fire when an aggregation is used to satisfy a query. This event is also especially helpful when trying to determine if the aggregations you have designed are actually useful. The “Query Subcube Verbose” event will give you very detailed information on which members from which dimension attributes are being queried to satisfy the query. Click Run when you’re done.
Now that the trace is running, its time to conduct our test.
1. The first thing I’ll do is execute the Clear Cache command.
<ClearCache xmlns=”http://schemas.microsoft.com/analysisservices/2003/engine”> <Object> <DatabaseID>AdventureWorksDW2012MutliDimensional-EE</DatabaseID> <CubeID>Adventure Works</CubeID> </Object> </ClearCache>
2. Initialize the Calculation script in your cube.
Select {} on 0 From [Adventure Works]
The report may take a few seconds or minutes to run depending on the query, but it should probably take longer than you’re used to simply because the query is running against an empty cache and SSAS will have to retrieve all of the data from storage.
When the query is finished executing, pause your Trace in SQL Server Profiler by clicking the pause button at the top of the Trace window. Now its time to take a look at the results. Find the Query End event in the Trace results. Find the Duration column. This number displays the total query duration in milliseconds. In my example, my query took just over 13 seconds to execute. So its not unbelievably slow, but certainly slower than we’d like.
In order to take a more in depth look at our query’s performance, lets save the Trace results to a SQL Server table so we can query it. Go to File > Save As > Trace Table.
Specify where you’d like to save the results and click OK.
Below you’ll see a very useful query that will break down where your query is experiencing a slow down. If you’re using this query for your testing, don’t forget to alter the From clause to query your trace table.
SELECT x.ConnectionID, x.QueryID, x.QueryDuration, p.SEDuration, CASE WHEN p.SEDuration > x.QueryDuration THEN NULL ELSE x.QueryDuration - p.SEDuration END AS FEDuration, y.[Number of SE Queries], y.[Thread Duration of SE Queries], w.[Aggregations Read], x.TextData FROM (SELECT a.ConnectionID, a.Duration AS QueryDuration, a.TextData, CAST (HashBytes('SHA1', CAST (reverse(CAST (TextData AS VARCHAR (MAX))) AS NVARCHAR (4000))) AS INT) AS QueryID FROM MyTraceTable AS a WHERE a.EventClass = 10) AS x -- Query End Event LEFT OUTER JOIN (/* Determine Query Subcube Verbose of Non-cache data */ SELECT ConnectionID, COUNT(*) AS [Number of SE Queries], SUM(Duration) AS [Thread Duration of SE Queries] FROM MyTraceTable WHERE EventClass = 12 -- Query Subcube Verbose AND EventSubclass = 22 -- Non-cache data GROUP BY ConnectionID) AS y ON y.ConnectionID = x.ConnectionID LEFT OUTER JOIN (/* Determine Aggregations that are ready from */ SELECT ConnectionID, COUNT(*) AS [Aggregations Read] FROM MyTraceTable WHERE EventClass = 60 -- Read from Aggregations GROUP BY ConnectionID) AS w ON w.ConnectionID = x.ConnectionID LEFT OUTER JOIN (/* Determine SE time */ SELECT ConnectionID, SUM(Duration) AS SEDuration FROM MyTraceTable WHERE EventClass = 11 -- Query SubCube GROUP BY ConnectionID) AS p ON p.ConnectionID = x.ConnectionID;
The results here are very telling. The column “QueryDuration” shows us the total execution time of the query. The column “SEDuration” shows us the amount of time SSAS spent pulling the data from storage (Storage Engine). The column “FEDuration” shows how long SSAS spent calculating our queries results (Formula Engine).
In this particular example, the vast majority of our query’s execution time is spent in the Formula Engine. Of the 13+ seconds spent executing the query, the query spends more than 12 seconds in the Formula Engine and only 297 milliseconds pulling the data from storage. This tells us that the problem is probably not with the design our SSAS solution, but rather a poorly written query. Unfortunately, this being an MDX query generated by Excel, there’s not a lot we can do about altering the query.
How Do We Fix The Query?
Typically when deciding where to spend your performance tuning efforts you want to start in the area where your query spends more than 30% of its time (If its a 50/50 split make an educated decision). In the previous example, we’ve determined our problem is with the query.
What can I do to improve my MDX query?
If you determine your query’s problem is the query itself, ensure SSAS is utilizing subspace computation instead of cell by cell computation. SSAS will usually evaluate groups of cells in your cube at a time but in certain situations SSAS will evaluate cube space one cell at a time. We want to to avoid that. You can get a hint that SSAS is calculation your results one cell at a time if the query on our trace table shows a large amount of Storage Engine queries. Certain SSAS/MDX functions can disable subspace computation.
1. Late binding functions (ex. StrtoMember, StrtoSet functions)
2. Set aliases
3. LookupCube function
Also, check out this blog for more info on ways to improve your MDX query.
But what if the problem is my SSAS solution’s design?
If you conduct your test and determine the majority of the query duration is spent pulling the data from storage, there’s a lot to consider when discussing cube design best practices. But here are some brief highlights of things to consider.
1. Can we design aggregations to help our query? Look at your test results and see if aggregations are being used to satisfy the query.
2. Can we implement a partition design strategy that keeps SSAS from having to query larger partitions?
3. Are the right partitions being queried? For example, if your query is asking for data for 2010 and you notice in your trace that the partitions for all the other years are being queried, this could indicate that SSAS is having a hard time figuring out which partition has the correct data. You may need to set the Slice property on your partition.
4. Create Attribute relationships
5. Leverage Natural Hierarchies
There are many more best practices for cube design, but this is probably a good starting point.
If you’ve found this helpful, share it with a colleague or a friend and leave me a comment. Feel free to leave me a question or feedback in the comments here or send me a note on Twitter! I love discussing new ideas and learning so don’t hesitate!
SSAS Lessons Learned: 29% Better Compression and 11% Better Query Performance
The Importance of Sort Order
This past week I taught the SSAS Masters class which is one of the virtual training classes offered by Pragmatic Works. One of the things we discuss in the class is the importance of sorting the fact data in your data warehouse in preparation for Analysis Services. Simply by sorting your fact data, you can see much improved compression which can also improve your query response time, as well. But how much improvement in compression and query response could you see? Well that’s what I set out to discover by running a couple little tests.
Better Compression? Yes, please.
I started with my beloved Fact Sales measure group in the Contoso Retail Operations cube. The Fact Sales measure group utilizes a named query in the DSV that is a basic select statement from the Fact Sales table in the Contoso Retail database. I checked the size of the single partition that made up the measure group and saw that is was just over 129 MB in size. Not big but I thought we could improve that.
So I set out to sort my data. The trick to sorting your data is to start by selecting your top three fields with the least amount of cardinality (or uniqueness). Try experimenting with different sorting to see what kind of results you can get. For the FactSales table, I started with PromotionKey, CurrencyKey, and ChannelKey then went from there. I simply set my partition to Query Bound and utilized the following query:
SELECT TOP 2147483647 CONVERT (INT, CONVERT (CHAR (8), DateKey, 112)) AS DateKey,
channelKey,
StoreKey,
ProductKey,
PromotionKey,
CurrencyKey,
UnitCost,
UnitPrice,
SalesQuantity,
ReturnQuantity,
ReturnAmount,
DiscountQuantity,
DiscountAmount,
TotalCost,
SalesAmount,
ETLLoadID,
LoadDate,
UpdateDate
FROM dbo.FactSales
ORDER BY PromotionKey, CurrencyKey, ChannelKey, StoreKey, ProductKey, DateKey;
Arguably there are better ways to sort the data for SSAS but that’s not the point of this blog post so I’ll leave that for you to decide.
I did a quick redeployment of the cube and processed the Fact Sales measure group.
The partition size dropped down to 93.11 MB in size! That’s a whopping 28% decrease in size! Awesome!
28% is a pretty big storage savings, especially when we could potentially be dealing with a lot more data in an enterprise scenario. Personally, I’ve seen storage savings up to 45% simply by sorting the data in the relational engine.
Better Query Performance? Sign me up!
With small fact .data files, we should see better query performance, right? I mean, theoretically it makes sense, but I was curious about how much better query performance we would see. So I set out with another little experiment.
First, I used Excel to generate a nasty little MDX query for my testing, which I captured with profiler:
SELECT
{
[Measures].[Sales Amount]
,[Measures].[Sales Quantity]
,[Measures].[Sales Unit Cost]
}
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME
,HIERARCHY_UNIQUE_NAME
ON COLUMNS
,NON EMPTY
CrossJoin
(
Hierarchize
(
{
DrillDownLevel
(
{[Date].[Calendar Week].[All Date]}
,,,INCLUDE_CALC_MEMBERS
)
}
)
,Hierarchize
(
{
DrillDownLevel
(
{[Product].[Product Name].[All Products]}
,,,INCLUDE_CALC_MEMBERS
)
}
)
)
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME
,HIERARCHY_UNIQUE_NAME
,[Product].[Product Name].[Product Name].[Product Available For Sale Date]
,[Product].[Product Name].[Product Name].[Product Brand Name]
,[Product].[Product Name].[Product Name].[Product Category Description]
,[Product].[Product Name].[Product Name].[Product Category Label]
,[Product].[Product Name].[Product Name].[Product Class]
,[Product].[Product Name].[Product Name].[Product Color]
,[Product].[Product Name].[Product Name].[Product Description]
,[Product].[Product Name].[Product Name].[Product Image URL]
,[Product].[Product Name].[Product Name].[Product Label]
,[Product].[Product Name].[Product Name].[Product Manufacturer]
,[Product].[Product Name].[Product Name].[Product Size Range]
,[Product].[Product Name].[Product Name].[Product Size Unit Measure]
,[Product].[Product Name].[Product Name].[Product Status]
,[Product].[Product Name].[Product Name].[Product Stock Type]
,[Product].[Product Name].[Product Name].[Product Stop Sale Date]
,[Product].[Product Name].[Product Name].[Product Style]
,[Product].[Product Name].[Product Name].[Product Subcategory Description]
,[Product].[Product Name].[Product Name].[Product Subcategory Label]
,[Product].[Product Name].[Product Name].[Product Subcategory Name]
,[Product].[Product Name].[Product Name].[Product Unit Of Measure]
,[Product].[Product Name].[Product Name].[Product URL]
,[Product].[Product Name].[Product Name].[Product Weight Unit Measure]
,[Date].[Date].[Date].[Asia Season]
,[Date].[Date].[Date].[Calendar Month]
,[Date].[Date].[Date].[Calendar Week Day]
,[Date].[Date].[Date].[Date Description]
,[Date].[Date].[Date].[Europe Season]
,[Date].[Date].[Date].[Fiscal Month]
,[Date].[Date].[Date].[Is Work Day]
,[Date].[Date].[Date].[North America Season]
ON ROWS
FROM
[Operation]
CELL PROPERTIES
VALUE
,FORMAT_STRING
,LANGUAGE
,BACK_COLOR
,FORE_COLOR
,FONT_FLAGS;
I then modified my partition to use the unsorted data, redeployed, and reprocessed. Executing the query against the Contoso database with a cold cache returned the following execution time, which I captured with Profiler again:
The query finished in just over 56 seconds. Against a warm cache, the query finished in about 50 seconds.
I once again altered my partition to be query bound the TSQL query previously mentioned, redeployed, reprocessed, cleared the cache, and ran my query. This time my query finished executing 49 seconds!
So simply by sorting the data for loading into my partitions, I saved 28% storage space and improved my query’s performance by 11%! Not bad for about 10 minutes worth of work, huh? I conducted my tests several times and each time the results were about the same.
The Tradeoff
It’s not all sunshine and roses. There is a slight drawback that you should be aware of and it has to do with additional time spent processing. By adding the Order By clause to your queries for the partitions, the queries will probably take longer to execute. This is going to add time to processing. This means you have to decide if you can live with the additional time processing in order to gain improved compression and query performance. Depending on many factors, the additional time spent processing could be minimal… or not. But you’ll have to decide if the additional processing time is worth the improvements.
The Conclusion
The lesson to be learned here is the importance of sorting your data for loading into your partitions. The performance improvements seen by simply improving compression of your partitions by sorting your data is an easy way to improve storage of your data as well as query performance.
I’d be interested to see if any of my readers could conduct their own tests and see what kind of performance benefits they see. So if you have a few minutes of your own, try this out and then leave a comment with your results. Good luck!
The Whaty-What of Data Mining Algorithms in SSAS
SQL Server Analysis Services features nine different data mining algorithms that looks for specific types of patterns in trends in order to make predictions about your data. This is a potentially very powerful tool and since I’ve been learning more about data mining recently I figured I’d put together a little bit of information and research I’ve done on these algorithms for my own reference as well as for the benefit of others.
Below you will see a list of the Data Mining Algorithms included in SSAS 2008 to 2012. I’ve included the type of the algorithm, what it does, and an example or two of when one might decide its an appropriate algorithm for your data and requirements. I’ve also included links to TechNet for more information.
Decision Tree
Type: Classification
What: This is probably the most popular data mining algorithm simply because the results are very easy to understand. Decision Tree tries to predict the value of a column or columns based on the relationships between the columns you have identified. Decision Tree also determines which input columns are most highly correlated to your prediction column(s).
When: Use the Decision Tree algorithm when you want to try to predict if a customer will buy your product or what characteristics may make a person a potentially good customer. For example, if you want to determine which customers you should send coupons, use Decision Tree to determine if a person has the qualities and characteristics of past customers.
Linear Regression
Type: Regression
What: This algorithm is used to predict continuous variables using continuous input variables. Linear Regression is a simpler version of Decision Tree without the splits in the tree.
When: Use the Linear Regression algorithm when you want to compute a trend for your sales data.
Naïve Bayes
Type: Classification
What: The Naïve Bayes algorithm is based on Bayes’ theorems and good for performing predictive analytics. This algorithm calculates a probability based on input columns you define.
When: An example of when you might use the Naïve Bayes algorithm would be when you want to predict how likely a customer is likely to respond to an email blast of how likely a patient is to get sick or contract a disease based on their demographic information.
Neural Network
Type: Classification/Regression
What: The Neural Network algorithm may be one of the least used data mining algorithms because it is the most difficult to interpret. Basically Neural Network combines each possible input attribute value with each predictable attribute value to determine probabilities. These probabilities can be used for classification or regression, but you may have a difficult time determining how the algorithm reached a particular conclusion.
When: Use this when you want to analyze the relationships between many complex inputs to determine a certain output, such as predicting stock movements or currency fluctuations.
Logistic Regression
Type: Regression
What: The Logistic Regression algorithm is a statistical method for determining the contribution of specified inputs to a particular set of outcomes. This algorithm is similar to Neural Network in the way that it models the relationships between the various inputs.
When: An example of when you might use this algorithm would be to predict what characteristics make a customer a repeat customer or if a convicted criminal is likely to become a repeat offender.
Clustering
Type: Segmentation
What: The Clustering algorithm is probably very close to Decision Tree as far as data mining algorithms that are used most frequently simply because, like Decision Tree, it is also very easy to understand. The Clustering algorithm groups cases in a data set using the input columns into groups or clusters of cases with similar characteristics.
When: This algorithm is great for detecting fraud or anomalies in your data because it is very easy to see which data does not fit into a cluster.
Sequence Clustering
Type: Sequence Analysis
What: Sequence Clustering is similar to clustering except instead of looking for clusters based on the similarity of characteristics, the clusters are based on a model. The algorithm groups sequences of events that are identical.
When: An example of when you might use this algorithm would be when you want to determine which sequence of events are likely to lead to hardware failure in your environment.
Association Rules
Type: Association
What: The Association Rules algorithm defines combinations of items in a set by scanning the data in the specified inputs and identifying frequent combinations of items.
When: Use this algorithm when you want to identify opportunities for cross selling. For example, the Association Rules may determine that if a customer purchases milk there may be an opportunity to market cookies to the customer.
Time Series
Type: Regression
What: The Time Series algorithm is useful in predicting or forecasting continuous variables over time, such as sales metrics. Time Series does not require additional data to make predictions about the future. Time Series makes predictions 3 or 5 units into the future.
When: Use the Time Series algorithm when you want to predict sales for the next 3 or 5 months.
Gotchas With Referenced Dimensions
A referenced dimension relationship allows us to create a relationship between a measure group and a dimension that is indirectly related to the measure group using an intermediate dimension. This is great, because we can now create relationships between our facts and dimensions that may not be directly related. Unfortunately, there are a couple of gotchas that you may not be aware of, which I read about here and here. I’m just offering my take on these issues, so thanks to Alberto and Jesse for the great content.
You can see several examples of referenced relationship types in the Adventure Works cube. One of those examples is the relationship between the Sales Target measure group and the Sales Territory dimension.
If you’ve ever created a referenced dimension relationship, you know that the Materialize option is checked by default. So one might assume this is the best option to go with.
To Materialize?
But before we can decide if the option to materialize the relationship is best, we must first understand what the materialize option really does. When we check the materialize option, the link between the referenced dimension and the facts is materialized. During processing of the measure group, a join is made to the dimension to look up the referenced dimension’s key value. This leads to slower cube processing but better cube query performance.
There is, however, an important thing to realize: The materialized relationship link between the facts and the correct attribute members is created during the processing of the partition, not the dimension! Why is this important? Imagine we need to update the relationships between the facts and the referenced dimension. We can no longer simply process the dimension. The link to the older referenced dimension attribute member still exists in the partition! The partition must be processed! This could be a huge issue if your cube is very large and you’re only planning to process the most recent partition. If the relationships changed in the referenced dimension, you could find yourself forced to process more than just the most recent partition in order to ensure the relationships between the facts and the referenced dimension are correct.
To Not Materialize?
So what’s the other option? We could leave the option to materialize the relationship unchecked. The downside to this is that query performance will suffer since the member will need to be looked up on the fly. The upside is that processing will be faster and we no longer need to process the partition to update the relationship since its looked up at query time.
But there is also a gotcha with using un-materialized referenced dimension relationships. Aggregations can not be created for a referenced dimension when the relationship is not materialized. If you run the Aggregation wizard, you’ll notice that any referenced dimensions with un-materialized relationships will be left out. This is by design. An aggregation should not be created (which is possible to do manually) for an un-materialized referenced dimension relationship because the relationship is resolved at query time.
What About the TSQL Query?
But what happens with the TSQL query during processing of the measure group? Here’s an example in the Adventure Works cube that we can look at:
This referenced relationship between the Reseller Sales measure group is materialized. When the Reseller Sales measure group is processed, a join is made to the intermediate dimension, to look up the Geography Key, which you can see here:
This join allows SSAS to add the referenced key to the measure group providing better MDX query performance for the users. But if we’re dealing with a significant amount of data in the fact table and a large intermediate dimension, the processing performance impact could be very significant.
But what if I de-materialize the relationship? The good news is that the TSQL query used to load the measure group will perform better because we’re no longer making a join to the intermediate dimension:
The bad news is that during executions of MDX queries, SSAS will need to look up the key value for the referenced dimension leading to slower MDX query performance.
The Conclusion
My personal recommendation would be to add the referenced key to the fact table so that way you can eliminate the dilemma altogether. I’ve personally seen referenced relationships in a multi TB cube destroy processing performance. The resolution for this particular client’s issue was to add the referenced key to the fact tables and create the regular relationships between the measure groups and the referenced dimensions. After implementing the fix, processing for the partitions was very fast.
If for some reason you can’t add the referenced key to the fact table and you must decide between materializing the relationship or not, of course the answer is “It depends”. If your cube is extremely large featuring an incremental processing strategy, I would recommend to not materialize the relationship. This will prevent you from having to process the entire cube should something in the referenced dimension change. You’ll just have to live with the somewhat slower MDX query performance.
If, however, your cube is relatively small or your data warehouse is completely reloaded each night, I would probably recommend materializing the relationships. The increased time to process won’t be that big of a deal since the cube is small and your query times will benefit because of the materialized relationship. You’ll also be able to create aggregations since the relationship is materialized. Of course, the “it depends” applies differently to every situation.
Feedback?
I hope you found this useful! If you did, please share it!
Also, if you have any questions or feedback, please leave a comment below! Thanks for reading!
Create Date Dimension Table Script
On a recent project I created a Date dimension script that I thought, I’d share (mostly so I would have access to it from anywhere). I based this script on one written by Mike Davis, but I included a few changes such as a first and last day of month flag, open flag, relative time period attributes, and a few others. Enjoy! Continue reading Create Date Dimension Table Script