Category Archives: Data Mining

The Whaty-What of Data Mining Algorithms in SSAS

SQL Server Analysis Services features nine different data mining algorithms that looks for specific types of patterns in trends in order to make predictions about your data. This is a potentially very powerful tool and since I’ve been learning more about data mining recently I figured I’d put together a little bit of information and research I’ve done on these algorithms for my own reference as well as for the benefit of others. 

Below you will see a list of the Data Mining Algorithms included in SSAS 2008 to 2012. I’ve included the type of the algorithm, what it does, and an example or two of when one might decide its an appropriate algorithm for your data and requirements. I’ve also included links to TechNet for more information.

Decision Tree
Type: Classification

What: This is probably the most popular data mining algorithm simply because the results are very easy to understand. Decision Tree tries to predict the value of a column or columns based on the relationships between the columns you have identified. Decision Tree also determines which input columns are most highly correlated to your prediction column(s).
When: Use the Decision Tree algorithm when you want to try to predict if a customer will buy your product or what characteristics may make a person a potentially good customer. For example, if you want to determine which customers you should send coupons, use Decision Tree to determine if a person has the qualities and characteristics of past customers.

Linear Regression
Type: Regression

What: This algorithm is used to predict continuous variables using continuous input variables. Linear Regression is a simpler version of Decision Tree without the splits in the tree.
When: Use the Linear Regression algorithm when you want to compute a trend for your sales data.

Naïve Bayes
Type: Classification

What: The Naïve Bayes algorithm is based on Bayes’ theorems and good for performing predictive analytics. This algorithm calculates a probability based on input columns you define.
When: An example of when you might use the Naïve Bayes algorithm would be when you want to predict how likely a customer is likely to respond to an email blast of how likely a patient is to get sick or contract a disease based on their demographic information.

Neural Network
Type: Classification/Regression

What: The Neural Network algorithm may be one of the least used data mining algorithms because it is the most difficult to interpret. Basically Neural Network combines each possible input attribute value with each predictable attribute value to determine probabilities. These probabilities can be used for classification or regression, but you may have a difficult time determining how the algorithm reached a particular conclusion.
When: Use this when you want to analyze the relationships between many complex inputs to determine a certain output, such as predicting stock movements or currency fluctuations.

Logistic Regression
Type: Regression

What: The Logistic Regression algorithm is a statistical method for determining the contribution of specified inputs to a particular set of outcomes. This algorithm is similar to Neural Network in the way that it models the relationships between the various inputs.
When: An example of when you might use this algorithm would be to predict what characteristics make a customer a repeat customer or if a convicted criminal is likely to become a repeat offender.

Clustering
Type: Segmentation

What: The Clustering algorithm is probably very close to Decision Tree as far as data mining algorithms that are used most frequently simply because, like Decision Tree, it is also very easy to understand. The Clustering algorithm groups cases in a data set using the input columns into groups or clusters of cases with similar characteristics.
When: This algorithm is great for detecting fraud or anomalies in your data because it is very easy to see which data does not fit into a cluster.

Sequence Clustering
Type: Sequence Analysis

What: Sequence Clustering is similar to clustering except instead of looking for clusters based on the similarity of characteristics, the clusters are based on a model. The algorithm groups sequences of events that are identical.
When: An example of when you might use this algorithm would be when you want to determine which sequence of events are likely to lead to hardware failure in your environment.

Association Rules
Type: Association

What: The Association Rules algorithm defines combinations of items in a set by scanning the data in the specified inputs and identifying frequent combinations of items.
When: Use this algorithm when you want to identify opportunities for cross selling. For example, the Association Rules may determine that if a customer purchases milk there may be an opportunity to market cookies to the customer.

Time Series
Type: Regression

What: The Time Series algorithm is useful in predicting or forecasting continuous variables over time, such as sales metrics. Time Series does not require additional data to make predictions about the future. Time Series makes predictions 3 or 5 units into the future.
When: Use the Time Series algorithm when you want to predict sales for the next 3 or 5 months.

SQL Rally 2012 Recap In Yo’ Face!

This past Thursday and Friday I attended my first PASS SQL Rally event in Dallas, Texas. The week was full of some pretty amazing sessions presented by some even more amazing speakers. If you were a DBA looking for performance tuning tips or other ways to perfect your profession, there were tons of great sessions for you. If you were a developer looking some great development tips and tricks to make your life easier, there were some really informative sessions given. And if you were a business intelligence developer, like myself, hoping to take your skills to the next level, there was definitely some amazing material to take advantage of.

I focused heavily on the business intelligence space and spent most of my time attending those sessions. I learned some great tips, increased my knowledge in several areas, and also networked a little with some really bright and great all-around people.

So What Sessions Did I Attend And What Did I Learn?

Performance Tuning SSAS Processing with John Welch

The first sessions I attended was Performance Tuning SSAS Processing given by John Welch (blog | twitter). There’s no doubt about, John knows his stuff. I learned a ton from his session. Here’s a few of the tips I picked up from John’s great sessions:

    • Instead of bringing entire tables into your DSV, use views. Views will allow you to tweak the SQL and use query hints to improve the performance of the Select statement.
    • When Analysis Services processes data, it does so one buffer at a time. If AS detects duplicate rows in a buffer, it will aggregate the data. Because AS aggregates duplicate records, you can save on space and increase performance if you order the data in your Select statement in your partitions.
    • Tune SSAS processing in 3 steps:
      • Dimension Processing
      • Partition ProcessData
      • Partition ProcessIndexes
    • Set the AttributeHierarchyOptimize property to false for attributes that are not used in queries often or if the attribute has a high cardinality and a near 1:1 relationship with the key attribute.

The Report Part Library with Jessica Moss

The next session I attended was The Report Part Library with Jessica Moss (blog | twitter). Jessica is one lady who is an amazing speaker and an SSRS guru without a doubt. Jessica’s wonderful sense of humor and bubbly personality made the session enjoyable and exciting. We covered how to create report parts, share those report parts across multiple reports, and then update those same report parts. Jessica also covered how to organize and manage a large Report Part Library. It was really great to finally meet Jessica in person.

Stop! Consolidate and Listen! with Jorge Segarra

After lunch, I attend Jorge Segarra’s (blog | twitter) session called, “Stop! Consolidate and Listen!”. Jorge covered some of the different methods of consolidation and dug into virtualization. Between Jorge’s sad rap skills and Patrick Leblanc’s even weaker beat boxing, I managed to pick up a few pointers.

Some of the Pro’s to virtualization:

    • Great isolation!
    • Flexibility
      • Snapshotting virtual machines (so if something goes horribly wrong, you can just roll back to the last snapshot)
      • Creating and dropping virtual machines is easy as pie
      • VM’s can be migrated across servers without having to be shut down!

Some of the Gotcha’s:

    • Behind the scenes resources could actually be shared, so make sure you understand the systems that your VM’s are running on.

Also, check out the free MAP toolkit from Microsoft. Jorge gave a great intro into the tool and how to use it to determine which machines are prime candidates for consolidation and virtualization.

Data Mining with Devin Knight & Adam Jorgensen

The last session I attended was Data Mining with Devin Knight (blog | twitter) and Adam Jorgensen (blog | twitter). Being far less than a master of data mining, I really looked forward to this session.

Data mining does three things:

  • Explores the data
  • Identifies patterns in the data
  • Performs predictions based on those identified patterns

There are several algorithms available for us to use:

  • Decision tree
    • Allows us to see how someone could come to a decision?
  • Clustering
    • Measures the distances between the cluster of points on a graph and an outlier
    • This method is often used for fraud detection and data validation
  • Time Series
    • Predict 3 units (days, months, years, etc) into the future
    • Based on historical values
  • Sequence Clustering
    • Used to determine somebodies next step
    • Where will a person click next on a web site?
  • Association Rules
    • Market basket analysis
    • Amazon uses this: People who bought this also bought this.
  • Naïve Bayes
    • A data mining algorithm that is literally hundreds of years old
    • A classification algorithm that could be used to determine the differences between customer who buy a product and customers who do not buy a product.
  • Neural Networks
    • Typically used with FICO scores
    • Takes multiple factors into account but can’t really point to one or two things to tell you how it came to a decision

Zero To Cube: Fast Track To Analytics with Adam Jorgensen & Dustin Ryan

I also had the wonderful pleasure of presenting on SSAS with Adam Jorgensen (blog | twitter). Our session was titled, “Zero To Cube: Fast Track To Analytics”. We had volunteers (some were volunteered by others) drive through creating a cube from the ground up. Within an hour and 15 minutes, we created a connection to our data source, created the data source view, made some enhancements to the data source view, created our cube, built a couple hierarchies and fleshed out the dimensions, and even built a report in Excel with a pivot table and a chart. We had a great, very participative audience and we had a blast! Thank you so much to all those that came to our session!

All in all, this year’s SQL Rally was an amazing event. I learned a lot of new things, made some great friends, and had a great time. If you weren’t able to make the event, you missed out big time. Definitely put next years SQL Rally on your calendar and make attending that event a priority. You won’t be sorry.