Data Mining
Upon successful completion of this unit the student should be able to;
Have a general overview of what data mining is
Know the foundations of data mining
Know how data mining works
Have an understanding of data mining architecture
Sub-topics
An overview of data mining techniques
Classification and Bayes rule
Discriminant Analysis
Logistic Regression
Neural Nets
Multiple regression
Cluster Analysis
Association Rules
At the end of this unit the students should be able to;
Have an understanding of the existing data mining techniques
Be able to comfortably use the data mining techniques for various activities
Sub-topics
Illustrating data mining using customer relations
Illustrating data mining using customer acquisition
Illustrating data mining using campaign optimization
At the end of this unit the student should be able to;
Illustrate data mining using customer relations
Illustrate data mining using customer acquisition
Illustrate data mining using customer campaign optimization
Sub topics
Trusting the model
Understanding the model
Comparing different models using visualization
At the end of this unit the student should be able to;
Have an understanding of why visualization is important
Know how to assess trust in a data mining model
Understand data mining models
Know how to compare different models using visualization
Data mining overview
Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. In this topic we will explore what data mining is, how data mining works and data mining architecture
Resources
Data mining techniques
An overview of data mining techniques
This overview provides a description of some of the most common data mining algorithms in use today
Resource
http://www.thearling.com/text/dmtechniques/dmtechniques.htm
Classification and Bayes Rule
In this topic we will examine the question of how to judge the usefulness of a classifier and how to compare different classifiers. Not only do we have a wide choice of different types of classifiers to choose from but within each type of classifier we have many options such as how many nearest neighbors to use in a k-nearest neighbors classifier, the minimum number of cases we should require in a leaf node in a tree classifier, which subsets of predictors to use in a logistic regression model, and how many hidden layer neurons to use in a neural net.
Resource
Discriminant Analysis
Discriminant analysis uses continuous variable measurements on different groups of items to highlight aspects that distinguish the groups and to use these measurements to classify new items. Common uses of the method have been in biological classification into species and sub-species, classifying applications for loans, credit cards and insurance into low risk and high risk categories, classifying customers of new products into early adopters, early majority, late majority and laggards, classification of bonds into bond rating categories, research studies involving disputed authorship, college admissions, medical studies involving alcoholics and non-alcoholics, anthropological studies such as classifying skulls of human fossils and methods to identify human fingerprints.
Resource
Classification Trees
If one had to choose a classification technique that performs well across a wide range of situations without requiring much effort from the application developer while being readily understandable by the end user a strong contender would be the tree methodology developed by Brieman, Friedman, Olshen and Stone (1984). We will discuss this classification procedure first, then in later sections we will show how the procedure can be extended to prediction of a continuous dependent variable. The program that Brieman et. al. created to implement these procedures was called CART for Classification And Regression Trees.
Resource
Logistic Regression
Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable,y, is binary (for convenience we often code these values as 0 and 1).As with multiple linear regression the independent variables x1,x2···xk maybe categorical or continuous variables or a mixture of these two types.
Resources
Neural Nets
In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm for Data Mining applications. Neural nets have gone through two major development periods -the early 60’s and the mid 80’s. They were a key development in the field of machine learning. Artificial Neural Networks were inspired by biological findings relating to the behavior of the brain as a network of units called neurons.
Resource
Assignment: Problem one
http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/assignments/
Multiple Regression
Perhaps the most popular mathematical model for making predictions is the multiple linear regression models. In this topic we will build on this knowledge to examine the use of multiple linear regression models in data mining applications
Resources
Assignment: Problem 2
http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/assignments/
Cluster analysis
Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea is to identify classifications of the objects that would be useful for the aims of the analysis. This idea has been applied in many areas including astronomy, archeology, medicine, chemistry, education, psychology, linguistics and sociology. For example, biological sciences have made extensive use of classes and sub-classes to organize species
Resource
Association Rules
The availability of detailed information on customer transactions has led to the development of techniques that automatically look for associations between items that are stored in the database. An example is data collected using bar-code scanners in supermarkets. Such ‘market basket’ databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together.
Resource
Illustrating data mining
Illustrating data mining using customer relations
Most marketers understand the value of collecting customer data, but also realize the challenges of leveraging this knowledge to create intelligent, proactive pathways back to the customer. Data mining - technologies and techniques for recognizing and tracking patterns within data - helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs. In this accessible introduction, we provide a business and technological overview of data mining and outline how, along with sound business processes and complementary technologies, data mining can reinforce and redefine customer relationships.
Resource
http://www.thearling.com/text/whexcerpt/whexcerpt.htm
Illustrating data mining using customer acquisition
For most businesses, the primary means of growth involves the acquisition of new customers. This could involve finding customers who previously were not aware of your product, were not candidates for purchasing your product (for example, baby diapers for new parents), or customers who in the past have bought from your competitors. Some of these customers might have been your customers previously, which could be an advantage (more data might be available about them) or a disadvantage (they might have switched as a result of poor service). In any case, data mining can often help segment these prospective customers and increase the response rates that an acquisition marketing campaign can achieve.
Resource
http://www.thearling.com/text/chapter10/chapter10.htm
Illustrating data mining using customer campaigns optimization
In most marketing organizations, there are a wide variety of ways to interact with customers and prospects. Besides the many possible offers that can be made, there are now multiple communication channels (direct mail, telemarketing, email, the web) that can be used. The process of marketing campaign optimization takes a set of offers and a set of customers, along with the characteristics and constraints of the campaign, and determines which offers should go to which customers over which channels at what time.
Resource
http://www.thearling.com/text/optimization/optimization.htm
Data Mining and Visualization
Introduction
The point of data visualization is to let the user understand what is going on. Since data mining usually involves extracting "hidden" information from a database, this understanding process can get somewhat complicated. In most standard database operations nearly everything the user sees is something that they knew existed in the database already. A report showing the breakdown of sales by product and region is straightforward for the user to understand because they intuitively know that this kind of information already exists in the database. If the company sells different products in different regions of the county, there is no problem translating a display of this information into a relevant understanding of the business process.
Trusting the Model
Attributing the appropriate amount of trust to data mining models is essential to using them wisely. Good quantitative measures of "trust" must ultimately reflect the probability that the model’s predictions would match future test targets. However, due to the exploratory and large-scale nature of most data-mining tasks, fully articulating all of the probabilistic factors to do so would seem to be generally intractable. Thus, instead of focusing on trying to boil "trust" down to one probabilistic quantity, it is typically most useful to visualize along many dimensions some of the key factors that contribute to trust (and distrust) in ones models. Furthermore, since, as with any scientific model, one ultimately can only disprove a model, visualizing the limitations of the model is of prime importance. Indeed, one might best view the overall goal of "visualizing trust" as that of understanding the limitations of the model, as opposed to understanding the model itself.
Understanding the Model
A model that can be understood is a model that can be trusted. While statistical methods build some trust in a model by assessing its accuracy, they cannot assess the model’s semantic validity — its applicability to the real world.
Comparing Different Models using Visualization
Model comparison requires the creation of an appropriate metric for the space of models under consideration. To visualize the model comparison, these metrics must be interpretable by a human observer through his or her visual system. The first step is to create a mapping from input to output of the modeling process. The second step is to map this process to the human visual space.
Resource
http://www.thearling.com/text/dmviz/modelviz.htm
Reference(s)
Massachusetts Institute of Technology, The mining beaver
Multimedia Educational Resource for Learning and Online Teaching,