Data Mining

Created April 25, 2024 by C4DLab University of Nairobi

Data mining Overview

Upon successful completion of this unit the student should be able to;

Have a general overview of what data mining is
Know the foundations of data mining
Know how data mining works
Have an understanding of data mining architecture

Data mining techniques

Sub-topics

An overview of data mining techniques
Classification and Bayes rule
Discriminant Analysis
Logistic Regression
Neural Nets
Multiple regression
Cluster Analysis
Association Rules

At the end of this unit the students should be able to;

Have an understanding of the existing data mining techniques
Be able to comfortably use the data mining techniques for various activities

Illustrating data mining

Sub-topics

Illustrating data mining using customer relations
Illustrating data mining using customer acquisition
Illustrating data mining using campaign optimization

At the end of this unit the student should be able to;

Illustrate data mining using customer relations
Illustrate data mining using customer acquisition
Illustrate data mining using customer campaign optimization

Data mining and visualization

Sub topics

Trusting the model
Understanding the model
Comparing different models using visualization

At the end of this unit the student should be able to;

Have an understanding of why visualization is important
Know how to assess trust in a data mining model
Understand data mining models
Know how to compare different models using visualization

Data mining overview

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. In this topic we will explore what data mining is, how data mining works and data mining architecture

Resources

Data mining techniques

An overview of data mining techniques

This overview provides a description of some of the most common data mining algorithms in use today

Resource

http://www.thearling.com/text/dmtechniques/dmtechniques.htm

Classification and Bayes Rule

In this topic we will examine the question of how to judge the usefulness of a classifier and how to compare different classifiers. Not only do we have a wide choice of different types of classifiers to choose from but within each type of classifier we have many options such as how many nearest neighbors to use in a k-nearest neighbors classifier, the minimum number of cases we should require in a leaf node in a tree classifier, which subsets of predictors to use in a logistic regression model, and how many hidden layer neurons to use in a neural net.

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/Lecture2notes.pdf

Discriminant Analysis

Discriminant analysis uses continuous variable measurements on different groups of items to highlight aspects that distinguish the groups and to use these measurements to classify new items. Common uses of the method have been in biological classification into species and sub-species, classifying applications for loans, credit cards and insurance into low risk and high risk categories, classifying customers of new products into early adopters, early majority, late majority and laggards, classification of bonds into bond rating categories, research studies involving disputed authorship, college admissions, medical studies involving alcoholics and non-alcoholics, anthropological studies such as classifying skulls of human fossils and methods to identify human fingerprints.

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/lecture4.pdf

Classification Trees

If one had to choose a classification technique that performs well across a wide range of situations without requiring much effort from the application developer while being readily understandable by the end user a strong contender would be the tree methodology developed by Brieman, Friedman, Olshen and Stone (1984). We will discuss this classification procedure first, then in later sections we will show how the procedure can be extended to prediction of a continuous dependent variable. The program that Brieman et. al. created to implement these procedures was called CART for Classification And Regression Trees.

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/L3ClassTrees.pdf

Logistic Regression

Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable,y, is binary (for convenience we often code these values as 0 and 1).As with multiple linear regression the independent variables x1,x2···xk maybe categorical or continuous variables or a mixture of these two types.

Resources

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/logreg.pdf

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/handloomsnew.pdf

Neural Nets

In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm for Data Mining applications. Neural nets have gone through two major development periods -the early 60’s and the mid 80’s. They were a key development in the field of machine learning. Artificial Neural Networks were inspired by biological findings relating to the behavior of the brain as a network of units called neurons.

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/NeuralNet2002.pdf

Assignment: Problem one

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/assignments/

Multiple Regression

Perhaps the most popular mathematical model for making predictions is the multiple linear regression models. In this topic we will build on this knowledge to examine the use of multiple linear regression models in data mining applications

Resources

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/lecture8.pdf

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/lecture9.pdf

Assignment: Problem 2

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/assignments/

Cluster analysis

Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea is to identify classifications of the objects that would be useful for the aims of the analysis. This idea has been applied in many areas including astronomy, archeology, medicine, chemistry, education, psychology, linguistics and sociology. For example, biological sciences have made extensive use of classes and sub-classes to organize species

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/lec11.pdf

Association Rules

The availability of detailed information on customer transactions has led to the development of techniques that automatically look for associations between items that are stored in the database. An example is data collected using bar-code scanners in supermarkets. Such ‘market basket’ databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together.

Resource

http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/lecture-notes/Lecture_16.pdf

Illustrating data mining

Illustrating data mining using customer relations

Most marketers understand the value of collecting customer data, but also realize the challenges of leveraging this knowledge to create intelligent, proactive pathways back to the customer. Data mining - technologies and techniques for recognizing and tracking patterns within data - helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs. In this accessible introduction, we provide a business and technological overview of data mining and outline how, along with sound business processes and complementary technologies, data mining can reinforce and redefine customer relationships.

Resource

http://www.thearling.com/text/whexcerpt/whexcerpt.htm

Illustrating data mining using customer acquisition

For most businesses, the primary means of growth involves the acquisition of new customers. This could involve finding customers who previously were not aware of your product, were not candidates for purchasing your product (for example, baby diapers for new parents), or customers who in the past have bought from your competitors. Some of these customers might have been your customers previously, which could be an advantage (more data might be available about them) or a disadvantage (they might have switched as a result of poor service). In any case, data mining can often help segment these prospective customers and increase the response rates that an acquisition marketing campaign can achieve.

Resource

http://www.thearling.com/text/chapter10/chapter10.htm

Illustrating data mining using customer campaigns optimization

In most marketing organizations, there are a wide variety of ways to interact with customers and prospects. Besides the many possible offers that can be made, there are now multiple communication channels (direct mail, telemarketing, email, the web) that can be used. The process of marketing campaign optimization takes a set of offers and a set of customers, along with the characteristics and constraints of the campaign, and determines which offers should go to which customers over which channels at what time.

Resource

http://www.thearling.com/text/optimization/optimization.htm

Data Mining and Visualization

Introduction

The point of data visualization is to let the user understand what is going on. Since data mining usually involves extracting "hidden" information from a database, this understanding process can get somewhat complicated. In most standard database operations nearly everything the user sees is something that they knew existed in the database already. A report showing the breakdown of sales by product and region is straightforward for the user to understand because they intuitively know that this kind of information already exists in the database. If the company sells different products in different regions of the county, there is no problem translating a display of this information into a relevant understanding of the business process.

Trusting the Model

Attributing the appropriate amount of trust to data mining models is essential to using them wisely. Good quantitative measures of "trust" must ultimately reflect the probability that the model’s predictions would match future test targets. However, due to the exploratory and large-scale nature of most data-mining tasks, fully articulating all of the probabilistic factors to do so would seem to be generally intractable. Thus, instead of focusing on trying to boil "trust" down to one probabilistic quantity, it is typically most useful to visualize along many dimensions some of the key factors that contribute to trust (and distrust) in ones models. Furthermore, since, as with any scientific model, one ultimately can only disprove a model, visualizing the limitations of the model is of prime importance. Indeed, one might best view the overall goal of "visualizing trust" as that of understanding the limitations of the model, as opposed to understanding the model itself.

Understanding the Model

A model that can be understood is a model that can be trusted. While statistical methods build some trust in a model by assessing its accuracy, they cannot assess the model’s semantic validity — its applicability to the real world.

Comparing Different Models using Visualization

Model comparison requires the creation of an appropriate metric for the space of models under consideration. To visualize the model comparison, these metrics must be interpretable by a human observer through his or her visual system. The first step is to create a mapping from input to output of the modeling process. The second step is to map this process to the human visual space.

Resource

http://www.thearling.com/text/dmviz/modelviz.htm

Reference(s)

Massachusetts Institute of Technology, The mining beaver

Multimedia Educational Resource for Learning and Online Teaching,

Data mining and analytic technologies

Data mining overview

Data mining techniques

An overview of data mining techniques

Classification and Bayes Rule

Discriminant Analysis

Classification Trees

Logistic Regression

Neural Nets

Assignment: Problem one

Multiple Regression

Assignment: Problem 2

Cluster analysis

Association Rules

Illustrating data mining

Illustrating data mining using customer relations

Illustrating data mining using customer acquisition

Illustrating data mining using customer campaigns optimization

Data Mining and Visualization

Introduction

Trusting the Model

Understanding the Model

Comparing Different Models using Visualization

Reference(s)

Standards