Statistics/Data Analytics
Overview
Basic Statistics for Non Mathematical Backgraound
Basic Statistics for Non-Mathematical Background
Basic_Statistics_for_Non_Mathematical_Backgraound.rtf
Introduction
In the recent decades, we have witnessed tremendous development in technology. As a result Indian Government has is opting for digitization of many systems and bring transparency in paying taxes, crediting subsidies and many more. Most of the banks have computerised their traditional processes. Across the countries in the world, organizations and individuals are adopting digitization of processes. This enables people to store large size information. Thus, huge amount of data is generated by the computers. Information and technology experts have developed new and efficient software to analyse the data.
Now government, corporate and NGOs have plenty of data which can be used for decision making. Question is how to use the data? This has created new jobs for developing various software for different purposes to analyse the data. Before we think of advanced analysis, basic knowledge in statistics is needed. It easy to use the software but what statistical tool should be used for required solution is to be known.
Purpose of this book is meant for explain basic statistics using data available in public domain. In recent years Government of India releases data on their government data portal. RBI publishes data on their website.
List of the websites for availing data is provided at the end of the book. Book is organised in 9 chapters. Data used for analysis is provided in excel sheet for each of the topic. Analysis of data using excel is shown in the same sheet. Interpretation and procedure for analysing is explained in the text.
Prerequisites of Data Analytics
Data analytics is about organising, analysing the data in business or social sciences and presenting the information which is easy to understand. This can be used to develop the insights. Question here is how data is collected. Data is obtained by measuring the object, subjects, or processes. Measurement means assigning numbers or other symbols to characteristics of objects according to certain pre-specified rules.
In our routine life, size of the room is measured in length, width and hight. Size of class or theatre is measured in no of the people can be accommodated. Fruits and vegetables are measured in kilograms and grams. Volume of the container is measured in cubic meters or cubic litre. In business world performance of the product/services are measured in sale or demand. Similarly, performance of the governments is measured by measuring the satisfaction of the citizens. Thus, in our daily life measurement is everywhere. We record this information or data. All these measurements we mentioned are of different types or at different scale.
We will briefly discuss different types of scales of data. There are four types of measurement scales. Nominal, Ordinal, Interval and Ratio level.
Nominal Level
In nominal scale, the numbers do not reflect the amount of the characteristic possessed by the objects/subjects. The numbers serve only as labels or tags for identifying and classifying objects with one-to-one correspondence.
For example, classification of employees into male and female or according to the position of the job. People can be classified according to the region, language, or religion. These are the examples where one can’t measure them but classify them in to classes.
Ordinal Level
Ordinal data include classification plus an indicator of order. Any series of numbers can be assigned that preserves the ordered relationships between the objects. This data is one dimension only. Rank is very common example of ordinal scale. Performance of students, employees can be ranked according to their performance. Cities can be ranked according the safety of the people residing in the city. Many times, liking of the people is measured on 1 to 5 or 1-to-10-point scale. Where each number indicates liking of the people in order. 5 or 10 indicates the highest liking and 1 is the lowest. This is known as Likert scale.
Rate your mobile services on 1-7 point scale
Poor……………………………………………………Excellent
1 2 3 4 5 6 7
Rate your brand according to preference
Most Preferred Somewhat preferred No preference
1 2 3
Interval Level
Numerically equal distances on the scale represent equal values in the characteristic being measured. It permits comparison of the differences between objects. The location of the zero point is not fixed. Both the zero point and the units of measurement are arbitrary.
For example, percentage of employment. Or percentage of the people who are vaccinated for covid 19.
Ratio level
Possesses all the properties of the nominal, ordinal, and interval scales plus it has an absolute zero or origin point. Zero value in data represents that characteristic being studied is absent in data. It is meaningful to compute ratios of scale values.
For example, Height, weight, time, volume.
Data Summarization
It is essential to understand the data we want to analyse. Question is how to do it. Two aspects of understanding data: develop the clarity on definition of variables in the data. Second is, to have grip on the data in the case of large size and this can be done visually as well as numerically.
Understanding of the data can be developed by summarizing the data. Summary figures represent the entire data. There are different ways of summarising the data. Most common methods of developing summary are, Graphical(visual), Tabular and Numerical. Choice of the method depends upon; type of data under consideration and purpose of the analyser. Analyser can summarise data by one or more than one method according to the need.
Graphical Summary
- Consider the Data on estimated GROSS VALUE ADDED at Constant price with base year 2011-12 from RBI. This is quarterly data from the year 2011-12 to first two quarters of the year 2021-22.
Table 2. Estimated Gross Value Added at Constant Price (Base Year 2011-12).
(File includes the data file and graphical summary of data.)
This is time series data. Analyser would like to know variation over a period of time under consideration. In this case best way is to plot a line chart as exhibited in the figure 1.
Figure 1. Yearly Gross Value Added at Basic price.
How to read the graph? Graph shows that Gross value is increasing marginally till the quarter 4 of the financial year 2019-20. Then suddenly is reduced in quarter 1 of the financial year 2020-21 which is effect of pandemic COVID. And increasing again from the next quarter.
Thus, simple line graph summarises the entire data and reveals the characteristics of the data. Now this insight can be helpful for developing statistical model.
- Let us consider the data set for number of persons educated in 23 districts of Andhra Pradesh. This is data available on Census of India 2011 website. Researcher wants to know what is the education status in various districts. Which district has maximum number of educated people. OR which district has low number of educated persons. Data also provides separately for males and females. Understanding of this data will help government or NGOs, where to promote education or needs more schools. Industrialists can use this understanding to decide what kind of industry to set up based on educated persons.
Graphical summary of this data is done by plotting simple bar charts. As shown in the figure. Data and graphs are provided in the following worksheet.
Let us consider first the number of persons. Bar chart for this data is as shown in figure 2.
Graph shows that Hyderabad and Rangareddy has maximum number of educated persons and there are quite a few districts where number of educated persons are very less.
Data is also about number of educated males and females separately. Suppose analyser wants to comparison of all the three data series then again bar chart will be very useful as given in the figure 3.
Figure 3 reveals that uniformly in all the districts, number of graduate females are much less compared to males.
Thus, simple graphical display of data makes it easy to develop insights of data.
Option to Bar chart: Option to bar chart is pie chart. Let us consider only number of female graduates.
Figure 4
Conclusion from Graphical summary of data: It is very easy to understand and develop insights of data by plotting simple graphs.
How to plot graphs in excel?
- Select the data array to graph.
- Click on Insert
- There are options for the graph
- Select the type of graph
- Output is desired graph
Data for practice
Tabular Summary of Data:
Sample survey was conducted to know the favourite colour of shampoo they use. 66 persons participated in a survey. Data can be tabulated as follow.
Count of RespondentNo | |
FavouriteShampooColour | Total |
Black | 13 |
Blue | 13 |
Green | 4 |
Pink | 5 |
Red | 6 |
White | 24 |
Yellow | 1 |
(blank) | |
Grand Total | 66 |
This is very common way of preparing tables.
Secondly consider the data on Industrial connections in towns of Gujarat from census of India 2011. Data is on number of connections in a town. Gujarat has 348 towns and census has recorded the no of industrial connections in each town. Data for each town does not reveal much. We need to summarise the data in such way it is easy to understand. Data is provided in a following table.
Most appropriate way of summarising the data is to develop a table, indicates the pattern in the data. Here analyser’s or government’s interest can be to understand the distribution of towns with number of connections. The number of connections perceived to be similar are grouped to gather and number of towns in the group as shown in the table.
No of Towns in a Group with Industrial Connections | |
0-150 | 210 |
151-300 | 56 |
301-450 | 14 |
451-600 | 16 |
601-750 | 10 |
751-900 | 8 |
901-1050 | 8 |
1051-1200 | 3 |
Above 1200 | 23 |
Total | 348 |
Table shows that there are 210 towns with number of connections between 0-150. There are only 3 towns in Gujarat has number of connections between 1051 -1200. And 23 towns have number of connections above 1200.
Table can be presented graphically. Graph shows very uneven pattern. Policy makers can use this to increase the number of connections or industrialist can decide where set up a new industrial unit.
Figure 6
Cross Tab: Consider the data on consumer survey. Data includes consumer’s name, Education, and Income of the family. Researcher wants to understand the association between education of the person and income of the family. Data is as follow.
To understand the association between the two characteristics; one should think of developing the cross tab.
Numerical summary:
Numerical summary is the most common method of summarising the data. In numerical summary we want represent the characteristics of the entire data with one or two numbers. (A)Numerical summary can tell us central point of the data. This means that majority data points are closed or equal to this value. When we want to understand central tendency of the data, there are three measures; mean, median and mode.
For example, score of the 10 students is as follow:
{35, 34, 37, 35, 34, 35, 37, 38, 38, 37} All the score values are between 34 and 38. Line plot is as follow
We can say that overall students have scored 36 points. We can say that score is around 36 OR centre of this data is 36.
In statistical language it is called mean of the data. Thus 10 figures of the data set are represented by simply one number =36. Which in usually known as mean or average of the data.
Mean = Sum of all the number / number of datapoints
= 360/10
=36
Statistical Notations:
In statistics, we use the notations and formulas for expressing the same. Score values for 10 students varies and so we call it variables. For variables, usually small letters of alphabets are used. Score values are x1,x2,x3,x4,x5,x6,x7,x8,x9,x10. Instead of writing this way, we write xi where i=1,2,…………….10
Mean ={x1+x2+x3…………….+X10 }/10 OR
Mean =
Median: It is not possible to accept the mean as average of the data when extreme values are present in the data. For example, if data points are;
3, 4, 5, 7, 8, 9, 11, 14, 15, 16,16,17,19,19, 20, 21, 22.
We can observe that data points are not closed by, some are small and some are relatively big numbers. Thus, numbers are not approximately uniform. In such cases, to use mean as average or representative of data is not right.
We prefer median. Median is the middle value in the ordered data. If data is not ordered, then should be ordered in ascending or descending order. Choose the middle value as median Advantage of median is, it is not affected by extreme values.
In the above dataset, middle value is 15. Median=15.
Mode: Is the most frequent number appearing in the data set. For example, if data set is of the numbers; 12, 15, 25,16, 18, 20, 21, 22, 18, 18.
Number 18 is occurring most frequently.
Mode=18.
All the three averages can be found using excel.
Consider the data in earlier used for percentages of slum population.
Now the question is; is it fair to say overall data points are equivalent to the mean?
When we say data points are equal to average, we miss the information that some of the data points are at distance from the average. Thus, we need to know what is the spread of the data.
In statistics, data spread is measured mainly two ways.
- Range= Maximum value -Minimum value
- Variance or Standard Deviation
This second method of measuring the spread or dispersion of the data is more scientific. It takes into account the distance between each of the data point and the mean of the data. Let us first understand the formula and calculation of variance and standard deviation.
Variance =
Standard Deviation =
Formula for variance is an average squared distance between each of the value and mean of the data. Formula for standard deviation is positive square root of the variance. This indicates the average absolute distance between each of the data point and mean. Let us understand this through calculations as shown in the following table.
Score (xi) | Mean | Difference | Square of Difference |
35 | 36 | -1 | 1 |
34 | 36 | -2 | 4 |
37 | 36 | 1 | 1 |
35 | 36 | -1 | 1 |
34 | 36 | -2 | 4 |
35 | 36 | -1 | 1 |
37 | 36 | 1 | 1 |
38 | 36 | 2 | 4 |
38 | 36 | 2 | 4 |
37 | 36 | 1 | 1 |
|
| Total | 22 |
Detailed calculations are shown in the table. Variance 22.
Standard deviation is square root of the variance =4.69
Percentiles of the data: Percentile of the data which divide the data into 100 parts. At least n% of the data lie below the nth percentile, and at most (100 - n) % of the data lie above the nth percentile.
Example: 90th percentile indicates that at least 90% of the data lie below it, and at most 10% of the data lie above it. The median and the 50th percentile have the same value.
Quartiles of the data: Quartile is the value which divide the data into four parts.
Q1: 25% of the data set is below the first quartile
Q2: 50% of the data set is below the second quartile
Q3: 75% of the data set is below the third quartile
Q1 is equal to the 25th percentile
Q2 is located at 50th percentile and equals median
Q3 is equal to the 75th percentile
Introduction to Probability
Everyone wants to predict the future based on what happened in the past or what is happening presently. In some of the instances, as a lay person, in general conversation when you predict, it is casual statement based on your experience or intuitions.
For example, predictions from the doctors and many researchers about third wave of covid 19 peak and when it will reside. These are based on the past data we had from India and other countries. No one predict with 100% accuracy. There is always degree of uncertainty. This is where role of statistician comes into picture.
Statisticians can’t predict with 100% accuracy but they can tell us what is the degree of certainty or uncertainty in prediction. Which can help us in taking decision with known risk.
When statistician/ mathematicians make a statement on predictions, it is based on the actual past data with degree of certainty. Statistician/ mathematicians have developed the measure of degree of certainty or uncertainty known as probability.
Definition of Probability
There are two approaches to define probability; classical approach and empirical approach.
Classical Approach to define probability: In a classical approach, we know the total possible events likely to occur. For example, if toss the coin, we know that head or tail will occur without actually tossing the coin. Similarly, if we throw a die, we know that totally six possible events can occur.
Each of these events are mutually exclusive. If head occurs then tail does not. Thus, there number of such cases, in which we know the total number of possible events will occur.
If we want to find the probability of occurrence of a particular event, then probability of desirable event is
P(E)=p= Number of desirable events / Total possible number of events
For example, probability of a head will occur in a toss of coin is =
Probability of occurrence of head / Total number possible events = 1/2
Practical Applicability of Classical Probability
In real world, there are many situations where possible outcomes are known and we want to find out the probability of the desirable event.
For example, while checking the quality of the product, one wants to know what is the probability that product is defective.
Marketer could be interested in knowing the probability that a particular product out of known number of products is preferred by the customer. In this case also total number of possible events are known and probability of favourable event is of interest.
There are many such situations where number of possible outcomes are known and we want to find out probability of a specific event.
Empirical Approach to Define Probability: To find the probability by this approach, one needs to conduct the experiment of interest. Based on the data collected though experiment is used to calculate the probability.
Consider the example in summarisation of data as follow.
Sample survey was conducted to know the favourite colour of shampoo they use. 200 persons participated in a survey. Data can be tabulated as follow.
Colour of Shampoo | No of Persons |
White | 50 |
Black | 40 |
Red | 18 |
Yellow | 15 |
Green | 20 |
Blue | 12 |
Pink | 45 |
Total | 200 |
Probability that consumer will prefer green colour of the shampoo, is = number of the consumer who chose green colour/ total number of consumers in survey
= 20/200=0.1
Consider another example where data is summarised according to the following table indicating the number of towns with industrial connections. Data is summarised in a groups.
Number of industrial connections in a town | No of Towns in a Group with Industrial Connections |
0-150 | 210 |
151-300 | 56 |
301-450 | 14 |
451-600 | 16 |
601-750 | 10 |
751-900 | 8 |
901-1050 | 8 |
1051-1200 | 3 |
Above 1200 | 23 |
Total | 348 |
Probability that towns having number of the connections between 601 and 750 is
=10/ 348
=0.03 approximately.
Statistical probability is usually denoted by p and value of p is always 0<=p<=1.
p=1 means that there are 100% chance that event will occur and p=o means event will not occur.
Types of Probability and General Probability Rules
In real life, probabilities are not simply probability of an event. Very often events are independent or one event depends upon other one. Sometimes probability of two events could be of interest. Occurrence of an event may be dependence of another event. It could also be of interest to find the probability of occurrence of all the events simultaneously.
To find these probabilities, there some rules to calculate these various types of probabilities. We shall discuss different types of probabilities with the example and how to calculate.
Marginal Probability is the probability of an event irrespective of the outcome of another variable.
Consider the example of perfume manufacturing company. Company has three sizes of perfumes and supplies to four cities. Their data for the last year is summarised as in the following table.
Note: Demand of the cities is independent of each other. This means that demand in Delhi does not depend or affect the demand in Bombay or any other city.
Capacity of Bottles | Ahmedabad | Bombay | Kolkata | Delhi | Total |
50 ml | 2400 | 1000 | 800 | 1400 | 5600 |
100 ml | 3000 | 600 | 2200 | 1200 | 7000 |
150 ml | 2800 | 1800 | 1200 | 1600 | 7400 |
Total | 8200 | 3400 | 4200 | 4200 | 20000 |
Company wants to plan their production for the coming year. They need to predict what are the chances that demand from these cities will be the same in the coming year as last year.
Question: What is the probability that demand from Kolkata will be the same this year also?
Answer: Total demand for Kolkata is 4200 bottles and total supply of bottles to all the cities is 20000. Thus probability that demand from
Kolkata = 4200/20000
= 0.21.
This is marginal probability for Kolkata. Thus there is 21% chance that Kolkata will demand the same number of all the bottles. Similarly, we can find the probability of demand of 100 ml in all 4 cities = 7000/20000 =0.35.
Additional Probability: This is the probability that either one or both events occur.
P(AUB)= P(A) + P(B)- P (A B)
In the above example, if we want to predict the probability of demand in Kolkata and Delhi, the it the probability of demand in Kolkata and Delhi. This probability is calculated as;
(4200/20000) +(4200/20000) =0.42
Conditional Probability: Is the probability if conditional event. Means that event B will occur only if event A occur. We use the notation as P(B|A). Other way, P(A|B) is the probability that event A will occur only if event B occur.
Let us try to understand this with the above example. Delhi has conveyed that next year their demand for 50 ml bottles has to be met. Company is short of resources and so if they meet the demand from Delhi, then what is the probability that demand from Ahmedabad will be met?
Total number of 50 ml bottles is 5600. Requirement of Ahmedabad is 2400. Probability that Ahmedabad demand will be met = 2400/5600=0 .42
Can be written as P(A|B). It means that given that demand for Delhi is met, probability that Ahmedabad demand will be met.
Multiplication Probability: This is the probability that both events occur.
- If A and B are independent then, P(A and B) = P(A)*P(B). This particular rule extends to more than two independent events. For example, P (A and B and C) = P(A)*P(B)*P(C)
- If events are not independent then, P(A and B) = P(A) • P(B|A) or P(B)*P(A|B) .
Let us understand through the example.
- Because demand for Delhi and Kolkata are independent, joint probability that demand of Delhi and demand of Kolkata are met is
P(D∩K) =p(D) x p(K)=0.21*0.21=0.0441
- Consider another example to understand multiplication when events are dependent.
Approaches for Calculation of Probability
In a lay person’s language probability is proportion. Probabilities are calculated by conducting experiments or using the past data.
There are few classic experiments; outcomes of which are known without conducting the experiments. For in an experiment of tossing a coin, there are two outcomes, either head occurs or tail will occur. In rolling of an die , total number of outcomes are six. In drawing a card from a deck of 52 cards, total number of outcomes are 52.
Outcomes of an experiment are known in advance and have very specific mathematical pattern. In all these experiments we observe that, 1) every trial of experiment is independent of each other. 2) Number of outcomes are identical. 3) If head occurs, tail does not occur. Both the events are mutually exclusive. 4)If experiment is conducted n times, then probability of head or tail remains the same.
Secondly, we can use the past data and calculate the desired probability using the relative frequency distribution of data.
Calculation of Probability: General formula for calculation of is Probability,
p = (No of desirable outcomes)/ (Total possible outcomes)
Probability that head occurs if you toss the coin = ½
If you toss the coin 25 times, then what is the probability that out of 25, 10 times head /tail occur?
This probability can be calculated without tossing a coin. If we toss the coin say n times, then probability that x number of times head or tail will occur can also be found without tossing the coin. For different values of x =1,2,3……………n. creates the pattern. This pattern is known as probability distribution. It can be noticed that x is variable and values of x are integers means discrete values.
Theoretical Probability Distributions:
Thus, there are some theoretical probability distributions. These probability distributions are mathematically modelled so that it is easy to calculate the probabilities. Beauty of these distributions functions is that, one can find the probability of a desirable event without conducting the experiment.
Types of variables decide the type of Probability distribution. If variable is discrete, probability distribution of the variable is discrete and if variable is continuous the it’s probability distribution is continuous.
Most basic discrete probability distributions are Binomial and Poisson.
Normal distribution, t distribution, Chi-square and F distributions are continuous probability distributions widely used for different purposes.
Empirical Probability Distributions:
These probability distributions are derived by conducting the experiment. Experimental data is collected and organised in tabular form as discussed earlier. Values of the variable can be grouped and corresponding frequency is written. Relative frequency approach can be used to calculate probability.
Probability Distributions
Probability distributions are divided into two parts; Theoretical probability distributions and empirical probability distributions. Theoretical probability distributions are mathematically modelled so that it is easy to calculate the probabilities using probability functions. Beauty of these distributions functions is that, one can find the probability of a desirable event without conducting the experiment.
Empirical Probability Distributions are derived by conducting the experiment. Experimental data is collected and organised in tabular form as discussed earlier. Values of the variable are tabled and corresponding frequency is written. Relative frequency approach is used to calculate probability.
Example: Sample survey was conducted to know the favourite colour of shampoo they use. 200 persons participated in a survey. Data can be tabulated as follow.
Color of Shampoo | No of Persons |
White | 50 |
Black | 40 |
Red | 18 |
Yellow | 15 |
Green | 20 |
Blue | 12 |
Pink | 45 |
Total | 200 |
Probability that randomly selected person will like white colour shampoo is p=50/200.
=.25
Color of Shampoo | No of Persons | Probability |
White | 50 | 0.25 |
Black | 40 | 0.2 |
Red | 18 | 0.09 |
Yellow | 15 | 0.075 |
Green | 20 | 0.1 |
Blue | 12 | 0.06 |
Pink | 45 | 0.225 |
Total | 200 | 1 |
Expected Mean and Variance: Probabilities in the table indicates that, randomly selected individual will like/prefer one of the colors listed. These probabilities can be used for expected number of individuals in a population.
Consider another example where random variable is score of the students in exam. Suppose 5 questions are asked in exam and so probabilities for correct answers for each of the questions is known form the past data as provided in the table below.
x | p(x) | Expected (x) p(x) |
0 | 0.17 | 0 |
1 | 0.33 | 0.33 |
2 | 0.21 | 0.42 |
3 | 0.22 | 0.66 |
4 | 0.04 | 0.16 |
5 | 0.03 | 0.15 |
| 1 | 1.72 |
Last column gives the expected value of x. And expected average number of correct answers are calculated using the formula;
=1.72.
Expected Variance =
Expected Standard Deviation=
Theoretical Probability Distributions:
Types of variables decide the type of Probability distribution. If variable is discrete, probability distribution of the variable is discrete and if variable is continuous the it’s probability distribution is continuous.
Most basic discrete probability distributions are Binomial and Poisson. Widely used continuous probability distributions are Normal distribution, t distribution, Chi-square and F distributions.
Discrete Probability Functions
As mentioned before, if variable is discrete, probability distribution function is discrete. In this book, we will discuss two basic probability distributions.
Binomial Probability Distribution
Let us consider the example of tossing a coin we discussed before. We know that probability that head or tail will occur is 0.5 in a single toss. If the coin is tossed 10 times, what is the probability that 6 times head will occur. Every time coin is tossed, number of possibilities are created. For example, if coin is tossed 3 times, the possible outcomes are;
- H, H, H (2) H, T, H (3) H, H, T, (4) T, H, H (5) T, T, T (6) T, H, H, (7) T, T, H
It can be noticed that 5th outcome does not have a single H. Probability pattern is created as listed in the table.
No of heads =x | Out Come | Number of Heads | P(x) |
0 | T, T, T | 3C0 =1 | (1/2)^3=0.125 |
1 | T, T, H | 3C1 =3 | 3*(1/2)^3=0.375 |
2 | H T H , H H T, T, H, H | 3C2 =3 | 3*(1/2)^3=0.375 |
3 | H HH | 3C3=1 | (1/2)^3 =0.125 |
It can be seen that distribution of probabilities is symmetric.
If p=0.5, distribution of probabilities are symmetrically distributed.
These probabilities are mathematically modelled as
Where p is the probability of occurrence of head and q is the probability of tail. x is the number of heads in an experiment of 3 tosses.
Model for n number of tosses or trials is as below
and
|
Note: Total probability =1
Mean of Binomial distribution =
|
Variance =
Parameters of the distribution are, N and p and notation used is B (N, p). Probabilities for binomial distribution are available in excel tables.
Probability distribution for n=9 and p=0.5
x | P(x) |
0 | 0.001953 |
1 | 0.017578 |
2 | 0.070313 |
3 | 0.164063 |
4 | 0.246094 |
5 | 0.246094 |
6 | 0.164063 |
7 | 0.070313 |
8 | 0.017578 |
9 | 0.001953 |
Following figures show distribution pattern for different values of p and n.
|
It can be noted that as N increases, distribution is becoming symmetric even though p is not the 0.5
Key conditions of Binomial Distribution
- Random variable is discrete.
- There are only two outcomes.
- Each of the outcome is independent of each other.
- Outcome of one trial is independent of the other trial. Probability of desirable event (p) remains the same in experiment.
- Total probability is equal to one.
- Because probability distribution is mathematically modelled, probabilities for different values of n and p are provided in tables. Excel software has the tables.
- For large value of n, distribution becomes symmetric.
Examples
Gujarat government believe that 60 % of villages in the district are having facility of vocational training schools. In the village, either there is vocational training school or not. Only two out comes in a single village and probability=1/2. Existence of vocational training school in one village does depend upon another village. Thus, each of the trials are independent of each other. Probability that village have vocational training school in each of the villages is ½. Thus, conditions for Binomial distribution are satisfied.
Consider Kachchh district of Gujarat. Researchers take a sample of 25 villages. What is the probability that exactly 15 villages have vocational training school?
Solution: In this example, p=0.6, q=0.4 , n=25 , x=15
Probability function=
We can find this probability from binomial table provided in excel. Follow the steps below.
- Click on f(x), choose binomial distribution. Dialogue box will appear.
- Four boxes are to be filled.
1. Number _s this is number of successes or favourable event. In our case, this is 15.
2. Trials this is total number of trials in experiment=25
3. Probability_s this is probability of success or favourable event in a single trial =0.6
4. Cumulative Here you need to choose one of two options of your interest. If you want exact probability of x=15, then enter the option, FALSE. P(x=15) =0.1611
If you want cumulative probability, then enter TRUE. P(x,<=15 )= 0.5753
Poisson Distribution
This is also a basic discrete probability distribution of discrete variable. Poisson process describes discrete occurrences of the event of interest over interval. For example, number of printing mistakes in a page, number of vehicles passing through at specific point in a small interval, number of patients suffering from side effects of covid vaccine. To sum up, in a fixed interval of time, number of objects or subjects is a Poisson variable. Variable assumes values between 0 and infinity.
Probability of occurrence of the event of interest is very small. Each occurrence is independent of the other and number of occurrences is infinite. The expected number of occurrences remain the same throughout experiment. Thus,
Poisson distribution has only one parameter denoted by lambda l.
Probability function is
Mean of the distribution =l
Variance of the distribution =l
Standard deviation=
Because this is theoretical distribution, probabilities are given in the excel table. How to use excel for Poisson probabilities.
- Click on f(x) and choose Poisson distribution
- Dialog box will popup. Put the value of x, number of events.
- Mean, expected number of events or value of lambda.
- Lastly if you need point probability that is x=a then choose FALSE and if probability x<=a, cumulative probability then choose TRUE.
For example, if lambda =5, and we want to find the probability (x=8).
Then, p(x=8) =0.065
P(x<=8) =0.9319
Shape of the Poisson distribution for different values of l is shown in the following figure.
Discrete Probability Functions
As mentioned before, if variable is discrete, probability distribution function is discrete. In this book, we will discuss two basic probability distributions.
Binomial Probability Distribution
Let us consider the example of tossing a coin we discussed before. We know that probability that head or tail will occur is 0.5 in a single toss. If the coin is tossed 10 times, what is the probability that 6 times head will occur. Every time coin is tossed, number of possibilities are created. For example, if coin is tossed 3 times, the possible outcomes are;
- H, H, H (2) H, T, H (3) H, H, T, (4) T, H, H (5) T, T, T (6) T, H, H, (7) T, T, H
It can be noticed that 5th outcome does not have a single H. Probability pattern is created as listed in the table.
No of heads =x | Out Come | Number of Heads | P(x) |
0 | T, T, T | 3C0 =1 | (1/2)^3=0.125 |
1 | T, T, H | 3C1 =3 | 3*(1/2)^3=0.375 |
2 | H T H , H H T, T, H, H | 3C2 =3 | 3*(1/2)^3=0.375 |
3 | H HH | 3C3=1 | (1/2)^3 =0.125 |
It can be seen that distribution of probabilities is symmetric.
If p=0.5, distribution of probabilities are symmetrically distributed.
These probabilities are mathematically modelled as
Where p is the probability of occurrence of head and q is the probability of tail. x is the number of heads in an experiment of 3 tosses.
Model for n number of tosses or trials is as below
and
|
Note: Total probability =1
Mean of Binomial distribution =
Std Deviation = |
Variance =
Parameters of the distribution are, N and p and notation used is B (N, p). Probabilities for binomial distribution are available in excel tables.
Probability distribution for n=9 and p=0.5
x | P(x) |
0 | 0.001953 |
1 | 0.017578 |
2 | 0.070313 |
3 | 0.164063 |
4 | 0.246094 |
5 | 0.246094 |
6 | 0.164063 |
7 | 0.070313 |
8 | 0.017578 |
9 | 0.001953 |
Following figures show distribution pattern for different values of p and n.
N=5 |
It can be noted that as N increases, distribution is becoming symmetric even though p is not the 0.5
Key conditions of Binomial Distribution
- Random variable is discrete.
- There are only two outcomes.
- Each of the outcome is independent of each other.
- Outcome of one trial is independent of the other trial. Probability of desirable event (p) remains the same in experiment.
- Total probability is equal to one.
- Because probability distribution is mathematically modelled, probabilities for different values of n and p are provided in tables. Excel software has the tables.
- For large value of n, distribution becomes symmetric.
Examples
Gujarat government believe that 60 % of villages in the district are having facility of vocational training schools. In the village, either there is vocational training school or not. Only two out comes in a single village and probability=1/2. Existence of vocational training school in one village does depend upon another village. Thus, each of the trials are independent of each other. Probability that village have vocational training school in each of the villages is ½. Thus, conditions for Binomial distribution are satisfied.
Consider Kachchh district of Gujarat. Researchers take a sample of 25 villages. What is the probability that exactly 15 villages have vocational training school?
Solution: In this example, p=0.6, q=0.4 , n=25 , x=15
Probability function=
We can find this probability from binomial table provided in excel. Follow the steps below.
- Click on f(x), choose binomial distribution. Dialogue box will appear.
- Four boxes are to be filled.
1. Number _s this is number of successes or favourable event. In our case, this is 15.
2. Trials this is total number of trials in experiment=25
3. Probability_s this is probability of success or favourable event in a single trial =0.6
4. Cumulative Here you need to choose one of two options of your interest. If you want exact probability of x=15, then enter the option, FALSE. P(x=15) =0.1611
If you want cumulative probability, then enter TRUE. P(x,<=15 )= 0.5753
Poisson Distribution
This is also a basic discrete probability distribution of discrete variable. Poisson process describes discrete occurrences of the event of interest over interval. For example, number of printing mistakes in a page, number of vehicles passing through at specific point in a small interval, number of patients suffering from side effects of covid vaccine. To sum up, in a fixed interval of time, number of objects or subjects is a Poisson variable. Variable assumes values between 0 and infinity.
Probability of occurrence of the event of interest is very small. Each occurrence is independent of the other and number of occurrences is infinite. The expected number of occurrences remain the same throughout experiment. Thus,
Poisson distribution has only one parameter denoted by lambda l.
Probability function is
Mean of the distribution =l
Variance of the distribution =l
Standard deviation=
Because this is theoretical distribution, probabilities are given in the excel table. How to use excel for Poisson probabilities.
- Click on f(x) and choose Poisson distribution
- Dialog box will popup. Put the value of x, number of events.
- Mean, expected number of events or value of lambda.
- Lastly if you need point probability that is x=a then choose FALSE and if probability x<=a, cumulative probability then choose TRUE.
For example, if lambda =5, and we want to find the probability (x=8).
Then, p(x=8) =0.065
P(x<=8) =0.9319
Shape of the Poisson distribution for different values of l is shown in the following figure.
Conditions for using Poisson Distribution
- Variable is discrete.
- Probability of occurrence of favourable event in a specific interval is very small.
- Average number of occurrences remain the same through out the experiment
Examples: On an average, 2 persons riding a vehicle get memo for violating the traffic rule at a particular traffic signal at the interval of half an hour. Then what is the probability that exactly 5 persons will violate the traffic rules and get memo.
Because probability of persons violating the traffic rule is small and number of the persons on the road is infinity, we can apply Poisson distribution.
Solution: Random variable is number persons, x getting memo.
Lambda =2
We need to find the probability that x=5
We can use Poisson table to find the probability. Probability(x=5) =0.036.
Discrete Probability Functions
As mentioned before, if variable is discrete, probability distribution function is discrete. In this book, we will discuss two basic probability distributions.
Binomial Probability Distribution
Let us consider the example of tossing a coin we discussed before. We know that probability that head or tail will occur is 0.5 in a single toss. If the coin is tossed 10 times, what is the probability that 6 times head will occur. Every time coin is tossed, number of possibilities are created. For example, if coin is tossed 3 times, the possible outcomes are;
- H, H, H (2) H, T, H (3) H, H, T, (4) T, H, H (5) T, T, T (6) T, H, H, (7) T, T, H
It can be noticed that 5th outcome does not have a single H. Probability pattern is created as listed in the table.
No of heads =x | Out Come | Number of Heads | P(x) |
0 | T, T, T | 3C0 =1 | (1/2)^3=0.125 |
1 | T, T, H | 3C1 =3 | 3*(1/2)^3=0.375 |
2 | H T H , H H T, T, H, H | 3C2 =3 | 3*(1/2)^3=0.375 |
3 | H HH | 3C3=1 | (1/2)^3 =0.125 |
It can be seen that distribution of probabilities is symmetric.
If p=0.5, distribution of probabilities are symmetrically distributed.
These probabilities are mathematically modelled as
Where p is the probability of occurrence of head and q is the probability of tail. x is the number of heads in an experiment of 3 tosses.
Model for n number of tosses or trials is as below
and
|
Note: Total probability =1
Mean of Binomial distribution =
Std Deviation = |
Variance =
Parameters of the distribution are, N and p and notation used is B (N, p). Probabilities for binomial distribution are available in excel tables.
Probability distribution for n=9 and p=0.5
x | P(x) |
0 | 0.001953 |
1 | 0.017578 |
2 | 0.070313 |
3 | 0.164063 |
4 | 0.246094 |
5 | 0.246094 |
6 | 0.164063 |
7 | 0.070313 |
8 | 0.017578 |
9 | 0.001953 |
Following figures show distribution pattern for different values of p and n.
N=5 |
It can be noted that as N increases, distribution is becoming symmetric even though p is not the 0.5
Key conditions of Binomial Distribution
- Random variable is discrete.
- There are only two outcomes.
- Each of the outcome is independent of each other.
- Outcome of one trial is independent of the other trial. Probability of desirable event (p) remains the same in experiment.
- Total probability is equal to one.
- Because probability distribution is mathematically modelled, probabilities for different values of n and p are provided in tables. Excel software has the tables.
- For large value of n, distribution becomes symmetric.
Examples
Gujarat government believe that 60 % of villages in the district are having facility of vocational training schools. In the village, either there is vocational training school or not. Only two out comes in a single village and probability=1/2. Existence of vocational training school in one village does depend upon another village. Thus, each of the trials are independent of each other. Probability that village have vocational training school in each of the villages is ½. Thus, conditions for Binomial distribution are satisfied.
Consider Kachchh district of Gujarat. Researchers take a sample of 25 villages. What is the probability that exactly 15 villages have vocational training school?
Solution: In this example, p=0.6, q=0.4 , n=25 , x=15
Probability function=
We can find this probability from binomial table provided in excel. Follow the steps below.
- Click on f(x), choose binomial distribution. Dialogue box will appear.
- Four boxes are to be filled.
1. Number _s this is number of successes or favourable event. In our case, this is 15.
2. Trials this is total number of trials in experiment=25
3. Probability_s this is probability of success or favourable event in a single trial =0.6
4. Cumulative Here you need to choose one of two options of your interest. If you want exact probability of x=15, then enter the option, FALSE. P(x=15) =0.1611
If you want cumulative probability, then enter TRUE. P(x,<=15 )= 0.5753
Poisson Distribution
This is also a basic discrete probability distribution of discrete variable. Poisson process describes discrete occurrences of the event of interest over interval. For example, number of printing mistakes in a page, number of vehicles passing through at specific point in a small interval, number of patients suffering from side effects of covid vaccine. To sum up, in a fixed interval of time, number of objects or subjects is a Poisson variable. Variable assumes values between 0 and infinity.
Probability of occurrence of the event of interest is very small. Each occurrence is independent of the other and number of occurrences is infinite. The expected number of occurrences remain the same throughout experiment. Thus,
Poisson distribution has only one parameter denoted by lambda l.
Probability function is
Mean of the distribution =l
Variance of the distribution =l
Standard deviation=
Because this is theoretical distribution, probabilities are given in the excel table. How to use excel for Poisson probabilities.
- Click on f(x) and choose Poisson distribution
- Dialog box will popup. Put the value of x, number of events.
- Mean, expected number of events or value of lambda.
- Lastly if you need point probability that is x=a then choose FALSE and if probability x<=a, cumulative probability then choose TRUE.
For example, if lambda =5, and we want to find the probability (x=8).
Then, p(x=8) =0.065
P(x<=8) =0.9319
Shape of the Poisson distribution for different values of l is shown in the following figure.
Conditions for using Poisson Distribution
- Variable is discrete.
- Probability of occurrence of favourable event in a specific interval is very small.
- Average number of occurrences remain the same through out the experiment
Examples: On an average, 2 persons riding a vehicle get memo for violating the traffic rule at a particular traffic signal at the interval of half an hour. Then what is the probability that exactly 5 persons will violate the traffic rules and get memo.
Because probability of persons violating the traffic rule is small and number of the persons on the road is infinity, we can apply Poisson distribution.
Solution: Random variable is number persons, x getting memo.
Lambda =2
We need to find the probability that x=5
We can use Poisson table to find the probability. Probability(x=5) =0.036.
Sampling Methods
In most of the studies population size very large. It is not possible to enumerate every subject or object in the population. At the same time, we want to know the population characteristics. Option is to draw a sample of appropriate size. Sample drawn should be representative of the population which describes all the characteristics of the population.
To draw the sample from population, there are few methods. Choice of the method depends type of the population. We shall discuss basic methods of sampling.
Before we discuss the sampling methods, let us understand the reasons for sample study instead of population enumeration.
- Feasible: It is feasible to study sampling units within the constraints such as time, cost. Apart from that it is feasible to approach all types of people in defined population.
- Time: Any study has contextual to time. It is not relevant if its findings are found too late. By that time population characteristics are changed and has no meaning of your findings. Thus, any study of your target population has to be completed within time frame.
- Cost: Population study costs. Sample is fraction of population and so cost is much less than in studying population through sample.
- Scope: Scope of studying sample units is much more than it could have in population study because units in sample are much smaller than population, detailed study of each of the units in the sample is possible. If needed, expensive equipment, qualified personnel can be used in the survey.
- Non-sampling errors: Errors which are not due to sample are known as non-sampling errors. These errors can be due to data entry, wrongly filled data, not appropriate instrument used for study.
- Better Accuracy: Well general concept is that population enumeration will give accurate results. But in reality, it does not especially when population size is large. In sample study, due to less cost and time, more qualified personnel can be used and if needed better technology can also be used. As a result sample survey provides better accuracy in findings.
Sampling Methods
Having realised the need for sampling we need to know how to draw a sample from the population. For drawing a sample from the population, fore most step is to prepare sampling frame. Sampling frame makes it is to draw the sample from population. Sampling frame is simply a list of all units in the population. For example, List of employees in the organization, list of the customers in any service, list of the all the students in the university are few examples of sampling frame. Having prepared sampling frame, it is easy draw the random sample. We shall discuss here the most basic sampling methods.
Use of method depends upon the type of population and objective of the study. Sampling methods are as shown in the figure.
There are two possibilities while drawing sample. If units drawn in the sample are replaced back to population, then it is known as sampling with replacement. In this, methos, there is a possibility of drawing the same unit in the sample.
Another way of drawing a sample is in which unit drawn in the sample is not placed back to the population. This is known as sampling without replacement. In this method, each of the units in the sample is unique. No unit in the sample is repeated.
Simple Random Sampling
As name indicates, Simple Random Sampling method is very simple. Condition to use this method is population is homogenous.
For example, while designing the vaccination plan, it was believed to be that risk level of covid infection is same among the people of age group, 60 and above. Now if researcher wants to study the effect on immunisation due to vaccine, researcher can use simple random sampling method.
Whenever population is perceived to be uniform with respect to the characteristics to be studied, simple random sample is appropriate. Question how to draw the sample? Use sampling frame available or prepare the sampling frame of the population under consideration. Having listed and numbered the units in the population, use random generator provided in excel software to draw the sample of the appropriate size. Let us understand this through following example. What should be size of sample will be discussed later.
One particular society of Ahmedabad, there 500 persons are residing of which 275 people were of the age group 60 and above. Ahmedabad Municipal corporation wants to take a sample of size 50. Then steps to be followed are,
- Prepare list of the 275 persons in the target population.
- Now click on f(x) and choose the option, RANDBETWEEN
- Enter the smallest number in BOTTOM= 1 and highest number in TOP=275.
- Random number will be generated between 1 and 275.
- Drag this cell till sample size is =50.
- Now copy these numbers and do paste special in the next column to fix the sample. Because these numbers will change if they are not fixed.
Random sample is drawn of the size 50.
First number chosen is 255. This means person with number 255 is chosen in the sample. Similarly other numbers are interpreted.
Notations used in Simple random sampling are;
| Population | Sample |
Size | N | n
|
Mean | ||
Variance | σ2 | S2 |
Proportion | P | p |
What should be the size of the sample?
In general sample should be such that sample describes all the characteristics of the population. At the same time, we do have criteria for deciding the size of the sample.
Criteria for choosing sample size
- It is within budget
- Sample survey or collection / analysis of data can be done within time frame
- Should be such that non-sampling errors are minimum
- Should cover the entire population
Stratified Random Sample
When population under consideration is not homogenous, then it is known as heterogeneous. In such case, to draw the sample from the population, heterogeneous population is divided into homogeneous groups/ strata. Each of these homogeneous groups /strata is considered as population and simple random sample is drawn as discusses before. This sample known as stratified sample.
Let us consider the population of mobile users. This population is heterogeneous as all age groups as well as all professionals use mobile phone. They can be divided according to their profession or age groups. Let us say that population is divided into groups; profession. Groups are academicians, business, service and laborers. Suppose size of each of the groups is; 200, 500, 800 and 1000. Population size is 2500. Suppose sample of 400 is to be drawn from the population.
There two ways to look into this (1) irrespective of size of each of the groups, sample of 100 from each of the groups can drawn randomly buy using random generator. This
(2) Random sample from each of the groups should be proportionate to the group size. Let us first calculate the proportion of the group in population. This is known as proportional stratified sample. In stratified sample;
G1: 200 : 200/2500 = 0.08 of the population
G2: 500 : 500/2500= 0.2 of the population
G3: 800 : 800/2500 = o.32 of the population
G4: 1000: 1000/2500 =0.4 of the population
Now draw the sample in this proportion from each of the group. This gives representation of each of the groups in sample.
Sample from group G1: 32, G2: 80, G3: 128, G4: 160
Total sample size is 400
Systematic Random Sample
As name suggests this is a very systematic method of drawing a sample. It is very convenient and relatively simple to administer. Decide the size of the sample. If size of the sample is n and size of the population is N, then ratio k=n/N. Now choose any number randomly between 1 and k. This number is the first unit of the sample. Thereafter, sample elements are selected at a constant interval, k, from the ordered sequence frame.
Consider the population of the size 100 and sample of size 10 is to be drawn from the population. Randomly the number chosen between 1 and 10 is 4. Then next unit in the sample is 14. Subsequently all the units are chosen at the interval of 10. Entire sample is;
4, 14,24,34,44,54,64,74,84,94.
Systematic sampling in practice: If population size is known, then sample size will determine the value of k as discussed before. If population size is not known in that case one chose randomly some number, day, hour and draw the systematic sample. For example, restaurant can decide to consider the number of customers at 8 o’clock in the evening every day for planning the resources. Production unit may draw a sample unit every half an hour. Service unit may draw a sample unit every 100th bill to study the consumption of services. Thus there ample examples where systematic sampling is possible.
Caution: While using systematic sampling procedure one should be cautious about the fact that if there is natural periodicity in the list of units, then systematic random sampling is not appropriate method of drawing random sample.
Cluster Sampling
This is fourth sampling procedure used when population is large and heterogeneous. Purpose of sampling is to capture all types of population units and their characteristics. In this case sample can also be a geographic space.
To draw a cluster sample, population is divided into non overlapping clusters or areas. Population is divided into non-overlapping clusters or areas. Each cluster is a miniature, of the population. A subset of the clusters is selected randomly for the sample. Clusters are made so that clusters are identical but internally cluster is heterogeneous. Each cluster has pattern identical to population. Thus, random sample of clusters exhibits both demographic and socioeconomic characteristics of population.
Cluster Sampling is convenient for geographically dispersed populations. In the case need for travel for sample survey, it is less expensive and easy to contact sample units. It also simplifies the administration of the survey.
Cluster sampling is statistically less efficient when the cluster elements are similar. Cluster sampling is more expensive and time consuming compared to simple random sampling.
Two stage cluster sampling: In the case of large size clusters, clusters are further divided into clusters before selecting basic units in the sample. For example, sample of cities on the country can further be divided into blocks. Sample of blocks can be used for selection of individuals.
Sampling in Practice: In practice, quite often it may not be possible to employ single method for drawing sample. Combination of two or more methods can be used.
For example, HSBC bank wants to understand customer satisfaction in India. SRS can be used first to select cities across India and Stratified random sample of customers can be drawn from all categories.
Non-Random Sampling Methods
Till now we discussed the random sampling methods depending on the type of population and tried to ensure that sample units are selected without any prejudice. Non-random sampling methods are meant to select the population units with specific reason; own judgement or policy. This is much simpler as process of selections of units in the sample is predefined. There four major methods; convenience sampling, judgmental sampling, quota sampling and snowball sampling.
Convenience Sampling: Selection of sample units is done according to the investigator’s convenience. Least expensive and least time consuming. This sample could be biased and may not be true representative of sample. For example, to do consumer research, investigator may go to the shopping malls and take the responses of those who are available.
Judgmental Sampling: Units in the sample are selected based on researcher’s judgment – particular individual represents the population This method is mostly used for pilot research in marketing. This is different form of convenient sampling and thus sample drawn could be biased.
Market research regarding VAS – entertainment, may select the youths from your neighborhood/organization for testing the design of the same.
Quota Sampling: As name suggests, quota is by considering the various characteristics of the target population. Quotas are decided such that composition of the sample is similar to the composition of population. This method is combination of judgmental and convenient sampling
Snowball Sampling: Initial respondents are selected at random. They are asked to identify the other respondents who belong to target population. Subsequent respondents are selected based on referrals leading to a snow balling effect. This sample is biased – referrals depend upon acquaintance.
Sampling and Non-Sampling errors
Because sample is fraction of population, there are likely to occur errors due to sampling as well as non-sampling. Sampling errors are the errors due to sampling. The reasons due to sampling could be; (1) Sample is not representative of population. (2) Sampling is done by inappropriate selection procedure. (3) Sampling procedure is not appropriate (4) Sample is too small or too large.
Non-sampling Errors: Are the errors not related to sampling. The reasons for these errors could be due to (1) Missing Data, Recording, Data Entry, and Analysis Errors. (2) Poorly conceived concepts, unclear definitions, and defective questionnaires. (3) Response errors occur when people do not know, will not respond, or overstate in their answers.
Statistical Inference
We have discussed probability distribution and how to draw a sample from the population. Now question is how to use sample findings. Sample findings are used to understand population characteristics. We infer the population parameters based on sample findings. This entire process is known as statistical inference.
Inferential statistics is about finding the best statistics with known error. Inferential statistics includes Statistical Estimation and Statistical Testing of Hypothesis. Statistical estimation methods provide us the point estimates and interval estimate with desirable accuracy. Testing hypothesis confirms the accuracy level of estimates.
Statistical Estimation: As mentioned before, there are two types of estimates ; point and interval estimates. Point estimate is the single value calculated from sample data and used for estimating the population parameter. For example, sample mean x-bar is used for estimating the population mean. Sample variance is used for estimating the population variance. Question here is what is the best estimator of parameter?
Statistical estimate should be unbiased, consistent, efficient and sufficient. We will try understand these properties of estimator without getting into detailed statistical derivation.
Unbiased estimator means, on an average estimator is expected to be equal or equivalent to population parameter. Expected value of sample mean is equal to population mean. Expected value of sample variance is equal to population variance. Unbiased estimator of population variance =
Consistent estimator means accuracy in estimation increases as sample size increases. Or Consistent estimator approaches to value of the parameter as sample size increases. Both and s2 are consistent because as n increases estimators are moving towards true value of the parameters.
Efficient estimator means the estimator with minimum variance among the class of estimator. Again and s2are efficient estimators for the population mean and population variance.
Sufficient estimator is the estimator which is calculated with all the information from the sample compared to all other information. and s2 are both theoretically proved to be sufficient estimators for population mean and population variance.
Till now we have discussed estimators for population mean and variance only because these are ones used most of the time.
Interval Estimation
Whenever we estimate the population parameter, we need to consider some margin of error as point estimate is based on one sample findings. Point estimate may be or may not be exact or close to population parameter. The question is what can be the margin of error for point estimate. This margin of error on both the side will give us interval estimate for population parameter.
Margin of error can be affordable or desirable value. How to decide the margin of error? As mentioned, based on your experience, you can decide what is feasible and financially affordable in production or services. This may become debatable issue among the peers.
Statisticians provide very objective solution to this. Using our knowledge of sampling distributions, we can decide the margin of error given the accuracy level or desired probability that population para meter falls within the interval.
In interval estimation, an interval is constructed around the point estimate, and it is stated that this interval is likely to contain the corresponding population parameter. Point estimate for population mean µ, thenfollows normal distribution with mean µ and Variance =. Interval estimate for population mean µ is;
is interval estimate for population mean.
Questions we need to address here are;
How reliable these intervals are for estimating population mean? What is the probability that population mean will fall into this interval? What is confidence level for the estimated interval?
Value of z will depend upon answers to these.
Probability that population mean will fall into interval is 1-α.
This is interpreted as; probability that population mean will fall into interval
{, } is 1-α and known as confidence interval with confidence coefficient 1-α.
If σ is not known the sample std, s is used. Confidence Interval to Estimate m
when n is Large and s is Unknown;
The confidence level associated with a confidence interval states how much confidence we have
that this interval contains the true population parameter. The confidence level is denoted by
(1 –α)100%. α Is desirable proportion of observations to be rejected and α is known as significance level.
For example, a= .05 means that 5 % of the observations are rejected. 95% of the observations are accepted. For given a, value dividing the data into 5% and 95% is critical value.
Before we construct the confidence interval for practical examples, let us clarify the confidence intervals through the following figures.
Let us understand to find the confidence interval through few examples.
- Gujarat has 33 districts. To estimate the number of vaccinated persons, conducts the survey among 400 persons in each of the districts. Average number of persons who are vaccinated in a district is 80% and std is 15%. What is the confidence interval which assures that 95% of the chance that population average falls within the interval?
Solution: Sample mean, =80%. And population std, σ=15%. Confidence coefficient=1-α=0.95
Use the formula, Prob ()=1-α
is found from the normal tables. Steps to be followed;
- Click on f(x) and choose, CONFIDENCE.NORM
- Enter the values; α=0.05, std,σ=15, sample size n=400
- =1.4699
- 0.95% confidence interval is [80-1.4699, 80+1.4699] = [75.53, 81.47]
If Sample size is small and population σ is unknown, t statistic is used to determine the confidence interval. In this case sample mean has t distribution as discussed before.
Confidence Intervals for m, when Sample size is =<30 and Unknown s ,
To calculate , t table is used as discussed in earlier section.
Suppose typist working in a firm, makes mistakes per day is recorded for 14 days. Given the data for 14 days, calculate 99% confidence interval for population mean.
Data : 3,1,3,2,5,1,2,1,4,2,1,3,1,1
Solution: Sample size =14, α=.01, 1-α=0.99, Sample mean=2.14, s=1.29
D.f=n-1=13, value of t for two tails and df=13=3.012
99 % confidence interval is;
99% of times, average mistakes typist will make between 1.1 and 3.18.
Sampling Distribution of proportion
Sample proportion is point estimate of population proportion p.
where x is no of elements of interest in sample of size n.
Expected value of =p
Std deviation of is=
|
Testing Hypothesis
In this section, we hypothesise value of population parameter based on past data or desirable value of the population. We draw the sample of appropriate size. Process of testing hypothesis involves; decide the point estimate of the population parameter, identify the distribution of the estimate and derive the interval with required confidence interval. If sample statistic fall within this interval, we confirm that hypothetical value is right or wrong.
Let me explain this with the simple example. Umbrella seller has to stock the umbrellas for the coming monsoon. Question for him is how many he should stock? He looks at the figures for the last 10 years and finds it average sale of umbrellas was 1500 /year and average variation, (std deviation) was 15. He decides to order 1600 umbrellas. Do you think he was right in his decision? Seeking answer to this is the process of testing hypothesis. Seller hypothesises the sale of 1500 umbrellas but orders 1600 umbrellas. We want to confirm if the decision was right or wrong.
Statistical procedure in testing of hypothesis helps to decide if the difference between two population parameters or two sample means or hypothetical value of parameter and sample mean is acceptable.
We have two types of hypotheses;
Null Hypothesis H0: A tentative assumption about population parameter
Alternative Hypothesis Ha: This is an alternative statement to null hypothesis
Forms of Hypothesis:
Hypothesis for population mean
- H0 : m=20 v/s Ha≠ 20 two tailed
- H0 : m=20 v/s Ha> 20 one tailed
- H0 : m=20 v/s Ha< 20 one tailed
Hypothesis for population proportion
- H0 : p=0.4 v/s Ha≠ o.4 two tailed
- H0 : p=0.4 v/s Ha>0.4 one tailed
- H0 : p=0.4 v/s Ha<0.4 one tailed
Three steps are involved in testing of hypothesis;
- Establishing the hypothesis
- Conducting the test
- Determining the business implications
Let us try to understand how to conduct the test through few examples.
Example1: Consumer of Vodafone Ides, a mobile service provider has to plan his financial expenses per month. He allots Rs. 800 with variation of 50. When actual data usage is examined, it is found that average monthly bill for data use is 900 and std =25 for the past 12 months. Was his assumption while planning right?
Answer: It is natural to assume that bill amount follows normal distribution with mean 800 and standard deviation =50. Null and alternative hypothesis can be formulated as follows.
Rejection region for rejecting the null hypothesis and acceptance region for accepting the null hypothesis is shown in the following figure.
|
|
|
|
|
How to decide Rejection or Acceptance Region?
We take a decision of accepting or rejecting hypothesis based on sample findings. So it is likely that we might commit the mistake in decision making. There four possibilities; (!!) null hypothesis is accepted when it is true. (2) null hypothesis is rejected when it is true. (3) null hypothesis is accepted when it not true (4) null hypothesis is rejected when it is not true.
This can be displayed in the following table.
Decision | H0 is true | H0 is not True |
Accept H0 | Correct | Error Type II |
Reject H0 | Error Type I | Correct (Power) |
As seen in the table, we may commit errors in two situations. These errors are classified as type I error and type two error.
Type I Error
- Rejecting a true null hypothesis
- The probability of committing a Type I error is called a, the level of significance.
Type II Error
- Accepting false null hypothesis
- The probability of committing a Type II error is called b.
We can see that in two situations, we are likely to make mistakes and two situations, we are likely to make correct decision. When null hypothesis is not true and accepted, then this correct decision is known as power of the test.
Objective in testing of hypothesis is minimize the errors. It is not possible to reduce both the errors simultaneously. General practice is that decide the value of a and minimize the value of b. Normal practice is to fix the value of a=0.1, 0.05. Significance level is 10 % or 5%.
Steps involved in Testing of Hypothesis
1. Establish hypotheses: state the null and alternative hypotheses.
2. Determine the appropriate statistical test and sampling distribution.
3. Specify the Type I error rate (a).
4. State the decision rule.
5. Gather sample data.
6. Calculate the value of the test statistic.
7. State the statistical conclusion.
8. Interpret your conclusions with reference to the problem.
Testing Hypothesis for Population Meanσ is known:
- When population is normal probability distribution with known value of σ, irrespective of size of the sample, sample mean is also normally distributed.
- If population is not normal and sample size is large, even then sample mean is approximately normal.
These results are theoretically proved in statistics. We do not discuss the statistical derivations in this text but do use these results.
Example 1: Average income of the farmers was believed to be 74914 and standard deviation sigma=14530 in Dang district in Gujarat. Researcher wants to confirm if it is true. Random sample of 112 farmers was drawn. Sample mean was found to be 78695. Test the hypothetical value of average income of the farmers.
Answer: It can be assumed that income of the farmers follows normal distribution as some will have more than average and some will have less than average.
H0:m= 74914 V/S Ha: m ≠ 74914 Population variance is known, s=1453
Appropriate statistic is
P(Type I error) =a =.05 and α/2=.025
Because it is two tailed test, we need consider α/2=.025
Acceptance and rejection region can be explained through the following figure.
|
|
Acceptance region is 0.95. We need to find the critical value.
Zcritical = 1.96 for two tailed test when a=.05
We need to calculate the value of z statistic using the above formula. Id value of z falls into acceptance region, hypothesis is to be accepted. If value of z falls into rejection region, null hypothesis is rejected.
Z=2.75 is > 1.96 means it falls into the rejection region. Null hypothesis that H0:m= 74914 is rejected because there is significant difference between hypothetical mean and sample mean.
Average income of the farmers is more than 74914.
Example 2: In the above example, if researcher want to test if average income of the farmers is greater than the hypothetical average 74914. Then test the hypothesis at 0.05 significance level.
Answer: in this case,
H0:m= 74914 V/S Ha: m > 74914 Population variance is known, s=1453
P(Type I error) =a =.05
Zα =1.645
Rejection region is only on the right tail.
Z=2.75 >1.645 . Means that sample mean falls in rejection region. Hypothesis is to be rejected.
Sample evidence indicates that hypothetical value is not true because there is significant difference between hypothetical average and sample mean.
Example 3: In the above example, if researcher wants to test if average income of the farmer is less than hypothetical value of average. Then test the hypothesis at 0.05 significance level.
Answer: in this case,
H0:m= 74914 V/S Ha: m < 74914 Population variance is known, s=1453
P(Type I error) =a =.05 Sample mean = 73914
Zα = - 1.645
In this case rejection region is on the left side.
_-7.28<-1.645 and falls into rejection region. There is significant difference between hypothetical value of average and sample mean. Again, hypothetical value is not true.
Example 4: Department store wants to introduce the festival discounts to their big spenders. Previous years bills during the same period indicated that average amount of bill/day was Rs 1400 and standard deviation was 980. Based on this data, store manager decides to give 20 % discount to all those who spend more than 4000.00. After the current season was over, he asked Mr. Rajat from IT department to Test if he was right to hypothesize the mean and standard deviation as last year. How IT department will work?
Answer: Random variable is amount of the bill.
Null hypothesis H0: Average bill amount = Rs 1400 and Standard deviation =980
Alternative Hypothesis Ha: Average bill amount ≠ Rs 1400
Random sample of the 100 was drawn.
Sample mean= 1181.98
Z statistic was the appropriate statistic. Z= -2.2246
Sample Mean | 1181.98 |
sample size | 100 |
Hypothesized Mean | 1400 |
Known standard deviation | 980 |
z statistic | -2.2246 |
Conclusion: Z statistic is negative means that sample mean is smaller than hypothesized mean.
-2.2246< -1.96 and thus it falls into rejection region Reject hypothesis that
Average bill amount is not equal to 1400.
Using p value method: Alternative way to reach conclusion of rejecting the hypothesis is compare the p value with desirable significance level α. p value is the observed significance level. Value of p indicates probability at the right or left tail from the actual sample statistics. Observed significance level is compared to theoretical (desired) significance level to decide to accept or reject the hypothesis. Most of the software including excel provide p value. Let us understand the use of p value with the example.
Let us consider again the same example1 we have discussed before.
H0:m= 74914 V/S Ha: m ≠ 74914 Population variance is known, s=1453
Appropriate statistic is
P(Type I error) =a =.05 and α/2=.025
Because it is two tailed test, we need to consider α/2=.025
As we have calculated earlier, z=2.75.
p value = p(z>2.75) = 1-0.99702=0.00298
p(z>2.75)=0.00298 |
2.75 |
α/2=0.025 |
1.96 |
Red colored area indicates p value and green colored area indicates the desired probability area.
0.00298 <0.025 means p value less than desired value. Conclusion is z statistic falls in the rejection region. Null hypothesis is rejected. Average annual income of the farmers is more than what is hypothesized.
|
t test for comparing the means
For comparing the means of two populations, t test is very common. When population standard deviation is not known and estimated by sample variance, t test is used because t test is function of sample variance. Use of t test can be thought of in following two situations.
- population mean and sample mean
- Two population means
We will understand the procedure of t test through few examples of each situation.
Example1: Comparing population mean and sample mean.
Local vendor who provide repairing service for AC, claims that technician will attend the customer within 6 hours during the day. Sample of 12 customers were asked if the time period within technician attended them. Data is given in the following table.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
3.03 | 6.33 | 6.5 | 5.22 | 3.56 | 6.76 | 7.98 | 4.82 | 7.96 | 4.54 | 5.09 | 6.46 |
Is there sufficient evidential support for claim made by courier service?
In this example, time period taken by technician is random variable. Sample size small and population standard deviation is not known. If we consider 6 hours’ time as population mean, then we are comparing population mean with sample mean in order to check if the claim made is true. Degrees of freedom =11. And α=0.05, For 11 d.f and α=0.05, value of t critical =1.796.
H0 : µ<=6 hours
Ha : µ>6 hours
Sample findings are as given in the following table.
Mean | 5.6875 |
Std Deviation | 1.580 |
t actual < t critical, accept hypothesis
Claim made by courier service is right they take time <=6 hours
Considering, P value=prob ( t>-.68499)=.746
p value> .05
Claim made by courier service is right.
Example 2: Independent Samples t test (unequal variance)
Profit per employee in public sector and private sector is to be compared. Because population std is unknown, t test is the appropriate statistic to be used. A sample of 22 banks from public sector and sample of 23 banks from private sector is drawn. Data is shown in the excel sheet.
H0: Average Profit per employee in public sector banks = Average Profit per employee in Private sector banks
Ha: Average Profit per employee in public sector banks ≠ Average Profit per employee in Private sector banks
This can be done on data analysis feature of excel. Steps to follow.
- Add data analysis from add in
- Click on data analysis
- Choose t test independent samples with unequal variance.
- Select the arrays from the data sheet
- Click on label.
Output is shown in the following table.
t-Test: Two-Sample Assuming Unequal Variances | ||
| Public Sector Profit Per Employee ( Rs lakh) | Private Profit Per Employee ( Rs lakh) |
Mean | 6.876666667 | 5.851363636 |
Variance | 8.623343333 | 6.033631385 |
Observations | 21 | 22 |
Hypothesized Mean Difference | 0 |
|
df | 39 |
|
t Stat | 1.238914667 |
|
P(T<=t) one-tail | 0.111390942 |
|
t Critical one-tail | 1.684875122 |
|
P(T<=t) two-tail | 0.222781883 |
|
t Critical two-tail | 2.02269092 |
|
How to interpret the output?
Statistic t=1.2389 is the calculated value. t critical is calculated for both one tail and two tailed tests. Because t distribution is symmetric distribution, t critical on one side is enough for the interpretation.
- For one tailed test, t critical =1.6848 and corresponding p value =0.11139.
p value is much larger than desirable confidence level=0.05. Thus, Hypothesis is not accepted.
- For two tailed, p value=0.2227 and critical value =2.02269. If desirable confidence level is 0.05, then p value is much larger than calculated p value. Thus null hypothesis is not accepted.
Conclusion: Both one tailed and two tailed tests reveal that null hypothesis is not true. This means that there is significant difference between mean profit per employee in public and mean profit per employee in private sector. This implies that Private banks make more profit per employee.
Example 3. Independent Samples t test (equal variance) :
Let us consider the road accidents in 34 states in 2010 and 2011. We want to compare these two data sets. We do not the population variance. For comparing data sets by comparing the means, we need to consider the t test with equal variance. We assume that both tears, variance of accidents is equal. Data is as in the following excel sheet.
Data can be done on data analysis feature of excel. Steps to follow.
- Add data analysis from add in
- Click on data analysis
- Choose t test independent samples and equal variance
- Select the arrays from the data sheet
- Click on label.
Out put of the analysis
t-Test: Two-Sample Assuming Equal Variances |
|
|
|
|
|
| Year 2010 | Year 2011 |
Mean | 14694.82353 | 14637.82353 |
Variance | 396775432.4 | 384203820.1 |
Observations | 34 | 34 |
Pooled Variance | 390489626.2 |
|
Hypothesized Mean Difference | 0 |
|
df | 66 |
|
t Stat | 0.011893086 |
|
P(T<=t) one-tail | 0.495273396 |
|
t Critical one-tail | 1.668270514 |
|
P(T<=t) two-tail | 0.990546791 |
|
t Critical two-tail | 1.996564419 |
|
Both one tail and two tail tests indicate that p value is much greater than desirable confidence level0.05. Thus, hypothesis is true. Governments can claim that across the states number of road accidents is same at 5 % confidence level.
Example 4: Paired samples t test.
When samples can be paired and researcher wants to compare the means of two samples.
Consider the data on GDP for the years 1951-52 to 2011-12. Data is separately for manufacturing and service sectors. Researcher is interested to know if manufacturing and service sector contribute equally in a gross GDP. This can be considered paired sample.
Out Put of the Analysis
t-Test: Paired Two Sample for Means | ||
| Manufacturing (in Rs. Cr.) at 2004-05 Prices | Services (in Rs. Cr.) at 2004-05 Prices |
Mean | 217444.5669 | 701209.1751 |
Variance | 47826222871 | 6.47348E+11 |
Observations | 62 | 62 |
Pearson Correlation | 0.996239103 |
|
Hypothesized Mean Difference | 0 |
|
df | 61 |
|
t Stat | -6.48903806 |
|
P(T<=t) one-tail | 8.68236E-09 |
|
t Critical one-tail | 1.670219484 |
|
P(T<=t) two-tail | 1.73647E-08 |
|
t Critical two-tail | 1.999623585 |
|
For both one tail and two tailed tests, p value is very very small than 0.05 desired confidence level. Thus, null hypothesis is that contribution is the same is not true. There is significant difference in contribution by Manufacturing and Service sectors. Service sector significantly contributes more compared to manufacturing sector.
Chi-Square Test
Chi-square is about testing Hypothesis about variance and Test of Independence. As discussed earlier, whenever we want to test if population variance and sample variance, chi-square statistic is used because chi-square is the ratio of sum of squares.
>0
We have discussed the chi-square distribution. In this section we discuss the three different application of chi-square statistic.
- Chi-square test for testing the hypothesis for population variance and comparing the variance of two populations.
Example 1: Government of Andhra Pradesh wanted to have insight on population in Urban Agglomeration. Town Planners thought average variance across the urban agglomeration and outgrowth should be 50000000000
Random Sample of 25 agglomeration and outgrowth was drawn and Population was recorded as shown in the following sheet.
Given the data, confirm if planners desired average growth is right.
In this case, null hypothesis is H0: σ2=50000000000 Ha: σ2≠50000000000
Sample size=25
Sample Variance =47423218481.2267
=(24*47423218481.2267)/ 50000000000
=22.76
c2.05,24 = 13.84 c2.95,24 =36.415
Chi-square calculated from the sample falls between these two values. Chi square falls into acceptance region and so accept the hypothesis. This implies that variation in population across the Urban agglomeration is true.
- Chi-square test for Goodness of fit
This test is done for discrete data. Chi-square test is used for testing the equality of actual and expected frequency. Formula for calculating Chi-square is;
In this case null hypothesis; H0: observed frequency = expected frequency.
Ha: observed frequency ≠ expected frequency
Let us understand how to perform this test. Consider the major online vegetables and fruit stores. In a survey conducted by market research company, individuals prefer buying from the stores was recorded as given in table.
Store | Number of buyers (fi ) |
Amazon Fresh | 200 |
Big Basket | 200 |
Jio Mart | 300 |
More | 250 |
Reliance fresh | 150 |
Total | 1100 |
Actual number of buyers is observed frequency. For this we calculate the expected frequency. Totally 1100 individuals have responded the survey. We do expect that 1100 individuals are distributed among the five stores in a observed proportion.
Store | Number of buyers - (fi ) | Proportion number of buyers /1100=p | ei=Expected Number of buyers =1100*p | (fi-ei)^2/ei =Chi-square |
Amazon Fresh | 200 | 0.18 | 198 | 0.02020202 |
Big Basket | 200 | 0.18 | 198 | 0.02020202 |
Jio Mart | 300 | 0.27 | 297 | 0.03030303 |
More | 250 | 0.23 | 253 | 0.03557312 |
Reliance fresh | 150 | 0.14 | 154 | 0.1038961 |
Total | 1100 | 1 | 1100 | 0.2101763 |
Calculated chi-square=0.21
Chi-square critical is to found from the table with the df=4 at significance level=0.05.
Chi-square critical= 9.49
Comparing chi-square calculated with Chi-square critical; chi-square calculated< Chi-square critical. Conclusion is accept the null hypothesis that observed and actual number of buyers from these stores are equal and can be concluded that all the five stores are equally popular among the consumers.
|
Steps for finding the expected frequency.
1.Calculate the proportion of the buyers in each store.
- Calculate the expected number of buyers in proportion of the total buyers.
- Then apply the formula given above
3. Chi square Test for independence of Two Attributes
Chi-square test is also used for testing the association between two attributes. Whenever attributes are measured in terms of number of individuals or objects and researcher/analyser wants to understand the association between the two attributes, chi-square test is used. Thus it is also classified as non-parametric test.
Here also observed frequency and actual frequency are compared and chi-square statistic is calculated with the same formula we used earlier.
Example: Placement agency conducted survey among 601 industries of different type. Wanted to know board member’s discipline in the organization. Thus, degree of the board members was recorded in Auto, Banking, FMCG, IT and Pharma industries. Data was recorded in the following table.
| Type of Industry |
| ||||
Degree of the board member | Auto | Banking | FMCG | IT | Pharma | Total |
Masters | 28 | 26 | 30 | 18 | 29 | 131 |
CA/PCA | 10 | 23 | 5 | 5 | 14 | 57 |
Engg | 30 | 27 | 23 | 20 | 20 | 120 |
MBA | 28 | 26 | 34 | 25 | 22 | 135 |
Economics | 16 | 11 | 13 | 13 | 14 | 67 |
Ph.D | 10 | 9 | 7 | 13 | 10 | 49 |
Law | 13 | 8 | 7 | 8 | 6 | 42 |
Total | 135 | 130 | 119 | 102 | 115 | 601 |
Each cell indicates the number of board members with their degree. Of 601 members, maximum number of the members have MBA degree. Minimum number of board members have law degree.
Research question is that industries are looking for board members with specific discipline. In other words, is there association between type of industry and degree. Let us do the chi-square test.
Null hypothesis, H0: There is no association between the two attributes; degree and industry type.
Ha: There is association between the two attributes; degree and industry type.
We follow the same steps as mentioned earlier. Calculation of expected frequencies for each cell are done using the formula,
Expected Frequencies
| Type of Industry |
| ||||
Degree of the board member | Auto | Banking | FMCG | IT | Pharma | Total |
Masters | 29.4 | 28.4 | 26 | 22.2 | 25 | 131 |
CA/PCA | 12.9 | 12.4 | 11 | 9.7 | 11 | 57 |
Engg | 27 | 26 | 23.8 | 20.2 | 23 | 120 |
MBA | 30 | 29.3 | 26.8 | 23 | 25.9 | 135 |
Economics | 15 | 14.5 | 13.2 | 11.4 | 12.9 | 67 |
Ph.D | 11 | 10.6 | 9.8 | 8.2 | 9.4 | 49 |
Law | 9.5 | 9.1 | 8.4 | 7 | 8 | 42 |
Total | 134.8 | 130.3 | 119 | 101.7 | 115.2 | 601 |
Chi-square for each cell is calculated using the same formula; Values of chi-square for each cell is as given the table.
Chi-Square Values
| Type of Industry |
| ||||
Degree of the board member | Auto | Banking | FMCG | IT | Pharma | Total |
Masters | 0.07 | 0.22 | 0.53 | 0.98 | 0.55 | 2.36 |
CA/PCA | 0.84 | 4.89 | 7.2 | 4.42 | 0.64 | 17.99 |
Engg | 0.3 | 0.04 | 0.03 | 0 | 0.45 | 0.82 |
MBA | 0.14 | 0.42 | 1.52 | 0.16 | 0.69 | 2.94 |
Economics | 0.06 | 1.11 | 0 | 0.2 | 0.09 | 1.46 |
Ph.D | 0.1 | 0.28 | 1.12 | 1.77 | 0.04 | 3.31 |
Law | 0.94 | 0.15 | 0.28 | 0.13 | 0.67 | 2.17 |
Total | 2.46 | 7.11 | 10.69 | 7.65 | 3.13 | 31.04 |
Total 31.04 is the value of Chi-square for the entire data. And is to be compared with critical value of Chi-square.
In this table there are 7 rows and 5 columns. Degrees of freedom is calculated as {number of rows-1}x {number of columns-1} =6*4=24
Chi-square for 24 degrees of freedom and α=0.05 is looked into the table. This is two sided test and thus probability area on both the tails =0.025 if `significance level is 0.05
Chi square on left side =12.40 and Chi-square on right side=39.36.
Chi-square value obtained from the sample=31.04 lies between 12.40 and 39,36. Means it lies in the acceptance region. Accept the hypothesis that there is association between type of industry and discipline of the board member.
Placement agencies can use this information when they recommend for board member in the industry.
Alternative method for Chi-square test.
Organize the actual and expected frequencies in two columns;
Observed Frequency | Expected Frequency |
28 | 29.4 |
10 | 12.9 |
30 | 27 |
28 | 30 |
16 | 15 |
10 | 11 |
13 | 9.5 |
26 | 28.4 |
23 | 12.4 |
27 | 26 |
26 | 29.3 |
11 | 14.5 |
9 | 10.6 |
8 | 9.1 |
30 | 26 |
5 | 11 |
23 | 23.8 |
34 | 26.8 |
13 | 13.2 |
7 | 9.8 |
7 | 8.4 |
18 | 22.2 |
5 | 9.7 |
20 | 20.2 |
25 | 23 |
13 | 11.4 |
13 | 8.2 |
8 | 7 |
29 | 25 |
14 | 11 |
20 | 23 |
22 | 25.9 |
14 | 12.9 |
10 | 9.4 |
6 | 8 |
Now choose from f(x) chi-square test. Select the two arrays of observed and expected frequencies as instructed in the pope up dialogue box. Result is the p value for the test. P value in this example is =0.71 >0.05. Conclude that null hypothesis is true. There is association between degree of board member and type of industry.
F test for Testing for Variance
Comparing the variances of two populations based on sample evidences, F test is used as F statistic is function of two variances.
Let us consider the same data we used for testing the difference in means of two samples in example 4 in paired t test. Now we want to compare the variances of two populations based on sample data.
H0 : variances of two populations are equal and Ha: variances of two populations are not equal
If you choose data analysis module in excel, there is option F test. Click on F test and required data. Out put is available.
F-Test Two-Sample for Variances | ||
|
|
|
| 8.36 | 5.42 |
Mean | 6.8025 | 5.871904762 |
Variance | 8.955609211 | 6.32556619 |
Observations | 20 | 21 |
df | 19 | 20 |
F | 1.415779859 |
|
P(F<=f) one-tail | 0.223396401 |
|
F Critical one-tail | 2.137008959 |
|
In the output table, p value =0.2233 >0.05. Thus F falls in acceptance region and so null hypothesis that population variances are equal is true.
Analysis of Variance
In practice, very often we need to compare performance of groups of individuals or machines. For example, crop production in in different districts of the state, sale of a product in different regions or unemployment in different regions. In short, it is very often, we need to compare two or more number of groups.
When we want to compare two groups, we use t test for testing the equality of means. When we want to compare more than two groups, analysis of variance (ANOVA) is the method of testing equality of the means.
Let us try to understand the method through the following examples. There are two methods of analysis of variance. One-way analysis of variance (One way ANOVA) and Two way analysis of variance (Two way ANOVA).
One Way Analysis of Variance
Market researcher believes that younger generation prefer most advanced mobile phones which are more expensive and ready to spend more. So, researcher collects the data on spend on mobile phones by different age groups. Age groups under consideration are 16-20, 21-30, 31-40 and 40-50. Data is collected for 40 respondents from each of the age group on cost of mobile phones these age group individuals are using. Data is as given the table below.
Sr.no/ Age group | 16-20 | 21-30 | 31-40 | 40-50 | Sr.no/ Age group | 16-20 | 21-30 | 31-40 | 40-50 |
1 | 10029 | 21061 | 34131 | 25231 | 21 | 8289 | 15823 | 43343 | 39419 |
2 | 14871 | 24226 | 30049 | 39144 | 22 | 7805 | 44274 | 26178 | 7792 |
3 | 5552 | 59768 | 30159 | 35104 | 23 | 9596 | 17014 | 48815 | 16942 |
4 | 14668 | 36978 | 58888 | 39823 | 24 | 10484 | 51891 | 14578 | 23130 |
5 | 13192 | 46594 | 18015 | 37369 | 25 | 10350 | 43778 | 60123 | 28788 |
6 | 10790 | 38861 | 36512 | 8766 | 26 | 8407 | 15759 | 52377 | 38987 |
7 | 9076 | 31702 | 61608 | 36317 | 27 | 14242 | 45618 | 44623 | 19174 |
8 | 10359 | 43651 | 28413 | 27826 | 28 | 13089 | 33616 | 58875 | 28255 |
9 | 13114 | 31676 | 20480 | 13686 | 29 | 10197 | 39769 | 49616 | 13417 |
10 | 5758 | 8765 | 58454 | 6808 | 30 | 13724 | 44874 | 54397 | 26905 |
11 | 12073 | 36594 | 58887 | 31737 | 31 | 11014 | 15827 | 12797 | 6746 |
12 | 12854 | 19554 | 52357 | 11203 | 32 | 6385 | 50806 | 59277 | 24836 |
13 | 10586 | 45275 | 25330 | 31474 | 33 | 12739 | 12703 | 32472 | 17083 |
14 | 13800 | 28757 | 37847 | 14007 | 34 | 14056 | 18818 | 28285 | 29356 |
15 | 8830 | 16548 | 42494 | 21521 | 35 | 11233 | 29416 | 29284 | 26204 |
16 | 12333 | 52554 | 18490 | 31930 | 36 | 10547 | 9083 | 30513 | 10386 |
17 | 11466 | 44530 | 63743 | 30982 | 37 | 10278 | 14810 | 35463 | 16124 |
18 | 13536 | 32244 | 19195 | 25005 | 38 | 11630 | 46518 | 49377 | 14262 |
19 | 5235 | 31936 | 37077 | 14349 | 39 | 12433 | 17761 | 51157 | 19174 |
20 | 12283 | 24408 | 36143 | 39374 | 40 | 11697 | 25351 | 44936 | 15885 |
First respondent in the age group, 16-20 spends INR.10029. Similarly, other values can be interpreted. This data can also be interpreted as cost of the mobile phones for 40 respondents is grouped according to the four age groups or cost of the mobile phones for 40 respondents is classified according age group of the respondent. Thus this is also called as one way classification of data.
Researcher’s objective is to know if cost of the mobile phone they are using depends on age group of the respondent. Research hypothesis is that all age groups spend the equal amount on mobile phone. Hypothesis to be tested should always be unbiased. Unbiased hypothesis means that researcher does not have any biasedness for any age group. This research hypothesis is to be converted into statistical hypothesis. Hypothesis to be tested is known as null hypothesis.
Null hypothesis in this case is;
H0: Means spend on mobile phones for all the age groups is equal.
If we denote mean of age group 16-20 as µ1,
mean of age group 21-30 as µ2,
mean of age group 31-40 as µ3,
and mean of age group 41-50 as µ4,
H0: µ1= µ2= µ3= µ4
If analysis reveals that null hypothesis is not true, in that case we need to have an alternative which is true. We call this as alternative hypothesis denoted by H1. Alternative to equality of means is at least two means are not equal.
H1: µi ≠ µj for i ≠ j. where i, j=1,2,3,4
Statistical problem is to test the hypothesis;
H0: µ1= µ2= µ3= µ4 v/s H1: µi ≠ µj for i ≠ j. where i, j=1,2,3,4
Testing the null hypothesis:
Let us denote each of the values in the data by Xik where i=1,2,3 ,,, c and k=1,,, N . We assume that each of the values in data has an impact of age group in addition of overall average. N is size of the population and c is the number of categories.
|
In order to test this hypothesis, we consider the mathematical model as given below.
The model implies that each value is an addition of overall mean and impact of age group and the error term. This is unique regression equation where the impact of categorical variable is measured.
Deriving the estimates for and
Based on the sample data, least square estimates are derived. Objective of the least square method is to estimate the parameters so that sum of squares of the errors is minimum. Errors are measured as difference between actual values and estimated values. Estimation of the parameters and is based on sample data.
Error, eik =
∑∑(eik)2 = 2 Sum of squares of errors is to be minimized with respect to and
Least square estimates are derived by solving the differential equations given as below.
∑∑(eik)2 = 2 =0
Estimates for and are
Substituting these estimates in equation;
eik = OR
∑∑(eik)2 = OR
This implies that
Total Sum of squares = Sum of squares due to states + Sum of squares due to Type of crime + error sum of squares
Using these formulas, total sum of squares (TSS), sum of squares for States (SSG) and Sum of squares due type of crime(SSC) are calculated and error sum of squares is a difference TSS-SSG - SSC.
Computation in excel:
Anova: Single Factor
| ||||
SUMMARY | ||||
Groups | Count | Sum | Average | Variance |
16-20 | 40 | 438600 | 10965 | 6353782 |
21-30 | 40 | 1269191 | 31729.78 | 1.91E+08 |
31-40 | 40 | 1594758 | 39868.95 | 2.19E+08 |
40-50 | 40 | 944521 | 23613.03 | 1.07E+08 |
These are the basic statistics; mean and variance for the groups.
Second table exhibits the results for analysis of variance.
ANOVA | ||||||
Source of Variation | SS | df | MS | F | P-value | F crit |
Between Groups | 18229696406.525 | 3.00 | 6076565468.84 | 0.00000 | 2.66257 | |
Within Groups | 20420047227.850 | 156.00 | 130897738.64 |
|
|
|
|
|
|
|
|
|
|
Total | 38649743634 | 159 |
|
|
|
|
Understanding the table: Total Sum of Squares(SS) and SS for groups is calculated according the formulas we discussed before. SS within the groups is error SS and is a difference of two. Degrees of freedom are equal to total terms -1. There are four groups and thus df for Between groups SS =4-1=3. Total terms in the data is 160=4*40. Df for total SS=159. Df for error SS=159-3=156.
Mean sum of squares (MS) for groups =SS Between groups /3
Mean sum of squares (MS) for within groups =SS within groups /156
F= MS Between Groups / MS within groups
Tabulated value of F for df 3, 156 at 5% level is =F critical.
P value is the actual probability F>46.42223
Conclusion:
- Based on F values: 46.42223>2.66257 means that Hypothesis of equality of means is not true. This implies that there is significant difference between the means of four age groups. This can be interpreted as, spend on mobile phones is significantly different among the age groups.
Based on P value: We have fixed the significance level at 0.05 or 5 % level. This is our desirable probability of rejection. P value for this data=0.000 implies that hypothesis of equality is not true. If P values is >0.05, then Hypothesis of equality of means is true.
Having realized that there is significant difference between average spend on mobile phones, we need to know which group spends most or least.
- We examine the averages for four groups.
Graphical presentation of the averages indicate that Age group 31-41 years spend maximum on mobile phones. Where as age group 16-20 spends minimum. Average of age groups 21-30 and 31-40 are closed to each other. Thus to conclude that average for age group 31-40 is significantly more than average for age group 21-30, we need to check by t test. P value for t test is 0.018<.05. Thus, there is significant difference between two averages. Conclusion is Average spend for age group 31-40 is maximum.
Note: Detailed results of ANOVA can be obtained in advanced statistical software such as SPSS, SAS, R , Tableau………..
Practice Problem: Crime against schedule cast committed in various districts for the six major states is given in the following table. Examine if there is significant difference in average number of crimes across the states. Data is provided in the following sheet.
Two Way Analysis of variance
In some of the data sets, values of the variable are organized or can be organized with respect to two criteria. In such cases, values of the variable could have impact of both the criteria. One needs to check impact of both the criteria on values. For example; Revenue generated by a company is to be compared according to products and region. Two test criteria are; revenue generated by different products and in different regions. Second example; Salary of employees can be classified according to category of the employees and type of organization. Test criteria are; category of employees and different types of organizations. This is an extension of one-way ANOVA.
In two-way ANOVA, total variance is decomposed in to three components.
Total SS= SS due to criteria1 +SS due to criteria 2. For the sample data, Total SS= SS due to criteria1 +SS due to criteria 2 +error.
To understand the method of two-way ANOVA, consider the example of number of crimes across the states; Bihar, Delhi, Gujarat, Maharashtra, Tamilnadu, West Bengal. We consider the three types of crimes; Rape, Attempt to Commit Rape and Kidnapping & Abduction. Examine if there is significant difference in average number of crimes of committed across the seven states. It is also of interest to check if there is significant difference between types of crimes. Data is as given in the table.
Rape | Attempt to commit Rape | Kidnapping & Abduction | |
Bihar | 1041 | 403 | 5158 |
Delhi | 2199 | 46 | 4301 |
Gujarat | 503 | 3 | 1569 |
Maharashtra | 4144 | 13 | 5096 |
Tamilnadu | 421 | 29 | 1335 |
West Bengal | 1199 | 1551 | 3938 |
Uttar Pradesh | 3025 | 422 | 10135 |
Hypothesis to be tested is there is equal number of crimes in all the states and of all the types.
Statistical Hypothesis:
For testing this hypothesis statistically, we need to frame two statistical hypotheses. Let us assume that µi is the average number of crimes in state i. i=1,2,3,4,5,6,7. And βj is the average number of crimes of type. J=1,2,3,4.
H01: µ1= µ2= µ3= µ4= µ5= µ6= µ7
H11: At least for two states, average number of crimes is not equal
H02: β1=β2=β3=β4
H12: At least for two types of average number of crimes are not equal
Testing the null hypothesis:
Let us denote each of the values in the data by Xij where i=1,2,3 ,,, 7 and j=1,,, 4 . We assume that each of the values in data has an impact of state and type of crime in addition of overall average. Total number of observations N=28.
In order to test this hypothesis, we consider the mathematical model as given below.
|
The model implies that each value is an addition of overall mean and impact of state, type of crime. This is an extension of one way ANOVA.
Deriving the estimates for and and
Based on the sample data, least square estimates are derived. Objective of the least square method is to estimate the parameters so that sum of squares of the errors is minimum. Errors are measured as difference between actual values and estimated values. Estimation of the parameters and and is based on sample data.
Error, eij =
∑∑(eik)2 = 2 Sum of squares of errors is to be minimized with respect to and and .
Computations in Excel
SUMMARY | Count | Sum | Average | Variance |
Bihar | 4 | 7756 | 1939 | 4714629 |
Delhi | 4 | 6668 | 1667 | 4078549 |
Gujarat | 4 | 2087 | 521.75 | 542010.3 |
Maharashtra | 4 | 9521 | 2380.25 | 6850528 |
Tamilnadu | 4 | 1850 | 462.5 | 369635.7 |
West Bengal | 4 | 7186 | 1796.5 | 2229800 |
Uttar Pradesh | 4 | 15917 | 3979.25 | 18053812 |
Rape | 7 | 12532 | 1790.286 | 1945772 |
Attempt to commit Rape | 7 | 2467 | 352.4286 | 313298 |
Kidnapping & Abduction | 7 | 31532 | 4504.571 | 8584115 |
Dowry Deaths | 7 | 4454 | 636.2857 | 714834.2 |
These are the basic statistics; mean and variance for the states and type of crimes.
ANOVA | ||||||
Source of Variation | SS | df | MS | F | P-value | F crit |
Rows | 34170466.43 | 6 | 5695077.73 | 2.914105 | 0.03638 | 2.661305 |
Columns | 75339242.39 | 3 | 25113080.8 | 12.85075 | 0.0001 | 3.159908 |
Error | 35177649.86 | 18 | 1954313.88 | |||
Total | 144687358.68 | 27 |
|
|
|
|
Understanding the ANOVA table: Total Sum of Squares(SS) and SS for groups is calculated according the formulas we discussed before. Error SS is a difference of total SS and RowSS +Column SS. Degrees of freedoms are equal to total terms -1. There are 7 states and thus df for Row SS =7-1=6. Types of crimes are 4 and so df=4-1=3.Total terms in the data is 28. Df for total SS=27. Df for error SS=27-6-3=18
Mean sum of squares (MS) for Rows = RowSS /6
Mean sum of squares (MS) for Columns =SS Columns /3
F= MS for states / MS error SS
Tabulated value of F for df 6, 18 at 5% level is =F critical=2.661305
Tabulated value of F for df 3, 18 at 5% level is =F critical=3.159908
P value is the actual probability F>2.914105for Rows=0.03638
P value is the actual probability F>12.85075for Columns=0.0001
Conclusion:
- Based on Critical F value: For Rows; 2.914105>2.661305 means that Hypothesis of equality of means is not true. This implies that there is significant difference between the means of seven states. This can be interpreted as, average number of crimes in seven states are significantly different.
Based on P value: We have fixed the significance level at 0.05 or 5 % level. This is our desirable probability of rejection. P value for rows =0.03638<.05 implies that hypothesis of equality is not true.
- Based on Critical F value: Critical F for Columns; 12.85075>3.159908 means that Hypothesis of equality of means is not true. This implies that there is significant difference between the means of types of crimes are significantly different.
Based on P value: We have fixed the significance level at 0.05 or 5 % level. This is our desirable probability of rejection. P value for columns =0.0001<.05 implies that hypothesis of equality is not true.
- Having realized that there is significant difference between average number of crimes committed across the seven states and types of crimes, we need to know which state has minimum number of crimes and what type of the crime maximum or minimum.
- We examine the averages for four groups.
Graphical presentation of the averages for states indicate that overall crimes committed in Gujarat and Tamilnadu are minimum. Using t test, we need to check if there is significant difference in average number of crimes committed in these two states. P values for t test=0.905422. Thus there is no significant difference between the average number of crimes committed in two states. Equality hols for these two states.
Graphical presentation of average number of crimes exhibit that Maximum number of the crime committed across the states is of kidnapping and abduction. Difference between maximum number of crimes and other numbers is very large and thus we can conclude that kidnapping and abduction is the crime committed maximum number of times across these seven states.
Minimum number of crime committed is of attempt to commit rape. This number is closed to Dowry Deaths and thus before we conclude, we need to perform t test. P value for t test=0.475265925>0.05. Reveals that there is no significant difference between the average number of these two types of crimes. Bothe crimes committed almost equal number of times. Equality of means holds true for these two types.
Practice problem: In following file, sale of tractors of different types in four regions is given. Write your hypothesis to be tested. Analyse the data and write you conclusion based on findings.
Regression Analysis
In real life, predictions are essential for planning the business / government policies, events or just to understand what needs to be done to achieve goal based on the past data/facts. To make it possible, it is essential to model the relationship between the relevant variables.
For example, Government wants to estimate the total food production of the country, OR, wants to know what quantity of fertilizers should be imported in a year. For this purpose, government should develop the prediction model of fertilizer. To develop the prediction model, it is essential to understand relationship between crop production and other agricultural predictors such as cropped area, irrigated area and fertilizer consumption. As it is understood that food production depends upon cropped area, irrigated area for the food grains, and fertilizer consumption. Required data for the period of 1967-68 to 2010-11 is given in the following table 1.
Second example we shall consider for the industrial output. Government of India conducts annual survey. Relevant data is collected from the report on annual of Industrial survey for the year of 2013-14. To estimate the industrial output, data for 4digit industries are considered. 4 digits stand for Section, Division, Group and Class. Code for this are decided by National Industrial Classification (NIC-2008). For example, Section A -Agriculture, Forestry and Fishing. Division is 01-Crop, animal production, hunting and related service activities. 011- Growing non-perennial crops. 0111-Growing cereals leguminous crops and oil seeds. Coding is done for all the various industries based on their activities. Following data considered to estimate industrial output for 143 4-digit industries. Data includes, Number of Factories, Working Capital, Materials Consumed and Total Output as given in table2.
To predict the crop production or Industrial output, we need to develop mathematical model. Relevant variables are identified as given in the tables. Predictions can be done in the following six steps.
- Check the relations between the variables visually.
- Check the relations between the variables mathematically.
- Decide on tentative model.
- Estimate the parameters of the model.
- Do diagnostic check (fitness of the model)
- Use the model for predictions.
Example 1. Fertilizer Consumption
Step 1. Understanding Relationship Between Variables Visually
Scatter diagram indicates relationship between two variables. We will plot scattered diagram to understand the relationship of fertilizer consumption with all other variables. Scattered diagrams; (Fig1), Area Irrigated for food grains v/s Fertilizer Consumption (Fig 2) Food Grain Production v/s Fertilizer Consumption, and (Fig3) Cropped area for food grains v/s Fertilizer Consumption are plotted in Excel.
Figure 1.
Graph in figure reveals that fertilizer consumption is increasing with the increase in area irrigated for food grains. More fertilizer is need if more area is irrigated for food grains. Fertilizer consumption can be predicted if one knows area irrigated for food grains. This is known as increasing linear relationship between Fertilizer consumption and irrigated area for food grains.
Figure 2
This indicates that food grain production increases with the increase in fertilizer consumption. Thus, for more food grain production, more fertilizer is needed.
Figure 3
This reveals that fertilizer consumption is not related to cropped area for food grains. Thus, cropped area does not determine the fertilizer consumption or have no relation.
Figures, 1,2 and 3 indicate that Fertilizer consumption or requirement of fertilizer is related to area irrigated and food production will be more if fertilizer consumption is more. In conclusion, Government should decide on requirement of the fertilizer depending on area irrigated by farmers. At the same time, fertilizer consumption by the farmers will determine food grain production for the year.
Step 2, Mathematical relationship between the variables:
Having observed that fertilizer consumption is related with the area irrigated for food grains and production of food grains, we need to check these relationships mathematically by calculating the correlation coefficient between these variables. (Correlation coefficient is discussed in earlier chapters.)
Table 1. Correlation Coefficient Matrix between the variables
| Total Consumption of Fertilizer (000,tonnes) | Area irrigated -food grains (000,hectares) | Total food grains production (000, tonnes) |
Total Consumption of Fertilizer (000,tonnes) | 1 | ||
Area irrigated -food grains (000,hectares) | 0.97 | 1 | |
Total food grains production (000, tonnes) | 0.967 | 0.98 | 1 |
Table 1. Indicates that Fertilizer consumption is highly correlated with Area irrigated for food grains and food production with the coefficient of 0.97 and 0.967. Also, food production and irrigation are highly correlated with coefficient of 0.98. If government of India wants to increase the food production, irrigation facilities in the country needs to improved and fertilizer should be made available to the farmers.
Tentative Mathematical Model:
Having understood the relationship between the variables, now the question is how to measure the exact relationship between the variables so that requirement of the fertilizer can be predicted OR production of food grains can be predicted for the season / year. This is done by developing the mathematical model for the relationship of the variables. Process of developing the model is known as regression analysis. This is a causal model as this model helps to predict the value of the variable for given values of the other variable/s.
For example, requirement of fertilizer consumption can be predicted for the given area irrigated for food grains. Correlation ship between the two variables is 0.97. Thus if we know the total area irrigated in the country, then mathematical model will help to predict the requirement of fertilizer. OR Area irrigated for food grains predicts the fertilizer consumption. Area irrigated is known as independent variable and fertilizer consumption is dependent variable.
Scattered diagram in figure 1, indicates that Area irrigated and fertilizer consumption have linear relationship. So hypothetical model indicating their relationship can be written as;
Fertilizer consumption= β0 + β1(Area irrigated)
β 0 is constant and β 1 is increase in o fertilizer consumption for one unit increase in area irrigated.
This is called as simple regression model as there is only one independent variable; Area irrigated.
Estimated model
Using the given data, we can estimate the predicted value of Fertilizer consumption. Estimated model is written as;
Fertilizer consumption= b0 +b 1 (Area irrigated) +error.
Predicted values will always have some error and so error term is very essential. Estimation of the model is nothing but deriving the values of constant and coefficient b1 from the data. b0 and b1 are estimated by least square method. Mathematically, error function can be expressed as;
Sum(Error)^2=Sum[ constant+b1 (Area irrigated)- Fertilizer consumption]^2
Formulas for calculation of b0and b1 are as below.
Using these values we calculate predicted values and compare with the actual values. Now we shall discuss how to do using excel software. Following files contains the data and output of the simple regression model.
Steps for performing regression analysis:
- Open the data sheet. And check if add-in Data Analysis is added. If not then add.
- Open the data analysis command and click on regression.
- Choose appropriate options for entering the data.
- Select Ok and output will be available.
Interpretation of output:
Regression Statistics | |
Multiple R | 0.969788737 |
R Square | 0.940490195 |
Adjusted R Square | 0.939106246 |
Standard Error | 1810.608672 |
Observations | 45 |
Multiple R =0.969788737 indicates the correlation between actual and predicted values of Fertilizer consumption. In the simple regression, Multiple R is coefficient between Fertilizer consumption and Area irrigated.
R square explains the proportion of variance explained by the model. In this example,
R square=0.94 means 94% of the variance is explained by the model.
ANOVA table gives overall fitness of the model. Significance F is very very small and so model fit is good.
Model coefficients are given in the table below.
| Coefficients | Standard Error | t Stat | P-value |
Intercept | -19437.9011 | 1202.695 | -16.1619 | 7.01E-20 |
Area irrigated for food grains (000,hectares) | 0.685476491 | 0.026295 | 26.06856 | 5.61E-28 |
P value for Area irrigated for food grains is very small mean that Area Irrigated is a significant variable in the model. OR Area Irrigated contributes significantly in determining the total required consumption of fertilizer. Regression model can be written as;
Fertilizer consumption=-19437.9+0.685Area Irrigated.
Negative value of Intercept can be interpreted as; if area is not irrigated at all then fertilizer consumption zero.
Coefficient of Area irrigated = 0.685 means that for one unit (1000 hectors) fertilizer consumption will be more by 0.685 times.
Calculation of predicted values:
Using the model above, calculate the estimated values of Fertilizer consumption. These estimated values are compared with actual values. For given value of area irrigated, fertilizer consumption can be calculated with the help of model. Predicted values are given in the output. When we plot both actual and predicted values on the same graph as shown in the figure below, it is clear that visually both the values are close for most of the years.
Model Fitness:
Having developed the model, it is essential to check how good the model is. Goodness of fit will be deciding factor to know with what accuracy predictions are made. Model fitness is measured in following three ways.
- Predictions made using regression equation are compared with the actual values.
We have already seen in the figure above that visually, actual fertilizer consumption and estimated fertilizer consumption are almost similar. So visually we can say that model fit is good.
- Overall model fit
Overall fitness of the model can be checked based on ANOVA table.
Sum of squares due to regression model is 2.23E+09. Significance of F is very small and thus model is significant. Model is accepted.
ANOVA | |||||
| df | SS | MS | F | Significance F |
Regression | 1 | 2.23E+09 | 2.23E+09 | 679.57 | 5.61E-28 |
Residual | 43 | 1.41E+08 | 3278304 | ||
Total | 44 | 2.37E+09 |
|
|
|
- Errors in predictions:
Errors are measured as difference between actual value and predicted value of fertilizer consumption. It is desirable that errors are random. Errors do not exhibit the pattern of model. Secondly, sum of squares of errors are minimum.
Above graph of errors does not exhibit any pattern of the original data and thus model developed captures variations in the data.
Minimum error is =-2450.6 and maximum error=5326.5.
Model is very good and can be used for forecasting fertilizer consumption.
Time Series analysis
Introduction
It is very usual phenomena that every citizen of India including Government anxiously keep waiting for the monsoon forecast from India Meteorological Department (IMD) to plan their various activities. Thus to forecast the future is a very natural part of our routine life. Forecast of the events or incidents is essential for planning of the future activities.
How forecast reports are prepared? Studying the existing patterns in past data, forecasts are made. When past data is recorded at equal time interval, we call it time series data. For example, monthly rainfall in the region, daily sale in retail shop, weekly travelers in the train/bus, yearly profit of the company, yearly demand of the product, hourly usage of internet data are exhibiting time series at different time intervals. Few examples of different time series data are given in the following figures.
Most commonly observed patterns in time series data are as shown in the following figures, figure1 to figure 5.
Figure 1 : Quarterly Agricultural GDP of India for the period of 2004-5 to 2014-15. Source: RBI
Figure 2 Quarterly GDP for Finance, Insurance, Real Estate & Business Services
Source: RBI
Figure 3 Nifty open index. Source: NSE
Figure 4 Index of Industrial Production for Basic Good Source: RBI
Figure 5. Total no of cars sold during 1963 to 2016. Source: Society of Automotive Manufactures
An exploring/Understanding pattern in time series and predicting values of the variable for the next period or periods is what TIME SERIES ANALYSIS.
Time series analysis
In all the examples above, values of variable in a time series fluctuate and exhibit the patterns. We need to identify them. Fluctuations in a time series are either short term fluctuations OR long term fluctuations. Fluctuations within a period less than year are known as Seasonal fluctuations. Fluctuations occurring over more than year are mainly Cyclical and Trend
Thus value of the variable in time series is resulting impact of components namely, Trend (T), Seasonal (S), Cyclical (C),Irregular (I). Impact of these components can be additive or multiplicative.
Time series analysis is a method of identifying presence of these components and to develop a mathematical model which can be used to forecast the value of the variable. Forecasting by analyzing time series involves the following major steps; (1) Exploring patterns (2) Developing appropriate model (3) Evaluation of forecasting model (4) Use the model to forecast the values for required time period.
I Exploring Patterns in Time Series Data
Before we consider time series data for modeling, it is essential to know existing patterns of variation in time series data. Most commonly used methods are; visual examination, Autocorrelation function (ACF), Partial Autocorrelation function (PACF). Visually we identify the presence of major components such as trend, seasonal or cyclical. Auto correlation and partial auto correlation helps to decide the model. ACF and PACF indicates if what time lag in the series impacts the present value. We shall discuss all three methods to explore the patterns of time series.
Visual examination:
Figure 1: Examining time series plot for quarterly agricultural GDP of India for the period of 2004-5 to 2014-15 in figure1, it can be observed that every quarter Q3 of year, GDP is highest and minimum in Q2. This pattern is the same every year though values are different. Thus fluctuations in a series exhibit seasonal variations. Secondly GDO for Q3 and Q2 both are increasing over a period. The series has presence of seasonal and trend variation (S+T).
Figure 2: Time series for GDP for Finance, Insurance, Real Estate & Business Services increasing over the years with minor ups and down in values. This exhibits increasing trend in series. Thus series has a presence of Trend (T) and irregular(I) variations.
Figure3: Time series plot for the Nifty Index, initially index is increasing during 2007 to 2008 and then falls during 2008-9 and again increasing and falling during 2011-12. Nifty index has impact of cyclical, trend and irregular fluctuations ( C+T+I).
Figure 4: Time series of monthly Index of Industrial Production for Basic Good has seasonal variations and increasing over the years. Series has presence of seasonal, trend and irregular variations (S+T+I).
Figure 5: Monthly time series of no cars sold exhibits cyclical and irregular variations as there is no repetition of specific pattern within a year, there is no clear increasing or decreasing trend. We can say that series has impact of cyclical and irregular variations.
Remember:
Reoccurrence of pattern within a year are called as seasonal variations in series.
Trend in a series is long term fluctuations
Cyclical patterns reoccur rover period of more than a year and period of cycle is not same.
Irregular variations in a series are random by nature and do exhibit any kind of regularity.
Auto correlation function (ACF) and Partial auto correlation function (PACF):
Visual check gives understanding that which components impact on values of the variable in a series. To develop mathematical model of time series data, we need to explore the patterns mathematically to decide the parameters of the model. Now we shall understand the characteristics of the time series data mathematically using ACF and PACF.
ACF: In a time series data, we believe that value of the variable at any specific time has impact of the past value. And so auto correlation is appropriate measure to know the impact of the past data. Auto correlation is correlation measures the correlation between two sets of observations of the series separated by some time lag. Let us define Yt = value of variable at time t. We expect that Yt and Yt-2 are correlated. For example, in a series of weekly sale, it expected that current weeks sale is correlated with last week. Or current month’s sale is correlated with last month. In general, autocorrelation measures the relation between present value of the variable and past value. When autocorrelation for various time lags is calculated, it called as auto correlation function (ACF).
Co relational dependency of order k between each j th element of series and the (j-k)th element k is time lag is calculated by using the formula;
PACF: In time series analysis, partial auto correlations are used to measure the degree of association between Yt and Yt-k when other time lags; 1,2,….k-1 are removed. For example, consider k=2. If we want to measure the association between, Yt & Yt-1 and Yt & Yt-2. Since Yt-1 and Yt-2 are just at one time lag, it is most likely that Yt-1 &Yt-2 are also significantly correlated. Thus, it is essential to know association of Yt and Yt-1 by removing Yt-2. Partial autocorrelation is partial correlation between Yt and Yt-1 when effect of Yt-2 is removed. When partial auto correlation is calculated for various time lags, it is known as partial auto correlation function (PACF). Formula for Partial auto correlation =
We shall discuss here ACF and PACF for the above series
Calculation of ACF and PACF, Consider the time series in figure 1 and in table 1.
Table 1
Year | Quarter | GDP for Agriculture & allied activities ( Yt) | Yt-1 |
2004-05 | Q1 | 1357.45 |
|
| Q2 | 1088.79 | 1357.45 |
| Q3 | 1724.01 | 1088.79 |
| Q4 | 1484.01 | 1724.01 |
2005-06 | Q1 | 1394.04 | 1484.01 |
| Q2 | 1130.23 | 1394.04 |
| Q3 | 1857.5 | 1130.23 |
| Q4 | 1563.09 | 1857.5 |
2006-07 | Q1 | 1447.89 | 1563.09 |
| Q2 | 1169.47 | 1447.89 |
| Q3 | 1932.09 | 1169.47 |
| Q4 | 1642.44 | 1932.09 |
2007-08 | Q1 | 1513.36 | 1642.44 |
| Q2 | 1224.18 | 1513.36 |
| Q3 | 2116.49 | 1224.18 |
| Q4 | 1696.77 | 2116.49 |
2008-09 | Q1 | 1557.34 | 1696.77 |
| Q2 | 1237.8 | 1557.34 |
| Q3 | 2038.1 | 1237.8 |
| Q4 | 1723.64 | 2038.1 |
2009-10 | Q1 | 1566.735459 | 1723.64 |
| Q2 | 1262.746229 | 1566.735 |
| Q3 | 2010.850211 | 1262.746 |
| Q4 | 1769.540952 | 2010.85 |
2010-11 | Q1 | 1640.460497 | 1769.541 |
| Q2 | 1356.164349 | 1640.46 |
| Q3 | 2264.437226 | 1356.164 |
| Q4 | 1917.074505 | 2264.437 |
2011-12 | Q1 | 1747.63 | 1917.075 |
| Q2 | 1410.96 | 1747.63 |
| Q3 | 2397.2 | 1410.96 |
| Q4 | 1982.52 | 2397.2 |
2012-13 | Q1 | 1779.47 | 1982.52 |
| Q2 | 1435.96 | 1779.47 |
| Q3 | 2415.56 | 1435.96 |
| Q4 | 2014.1 | 2415.56 |
2013-14 | Q1 | 1850.84 | 2014.1 |
| Q2 | 1508.22 | 1850.84 |
| Q3 | 2504.77 | 1508.22 |
| Q4 | 2141.65 | 2504.77 |
2014-15 | Q1 | 1921.15 | 2141.65 |
| Q2 | 1557.12 | 1921.15 |
Using the formula given above, auto correlation is calculated. Auto correlation for time lag 1 is =.07
Similarly Auto correlation with time lag 2, 3….. can be calculated as shown in table 2.
Autocorrelations | ||
Lag | Autocorrelation | |
1 | .070 | |
2 | -.027 | |
3 | -.003 | |
4 | .836 | |
5 | -.003 | |
6 | -.090 | |
7 | -.060 | |
8 | .685 | |
9 | -.056 | |
10 | -.135 | |
11 | -.108 | |
12 | .537 | |
13 | -.107 | |
14 | -.176 | |
15 | -.145 | |
16 | .404 |
Graph for the auto correlations in Table 2. It can be seen that at every fourth time lag 4, 8, 12 and 16, auto correlation is significantly high and is declining gradually as time lag increases.
Every fourth auto correlation is significant means seasonality is present.
This means that current values do not have high correlation with too past. Values of 2012-13 have higher correlation with values in 2011-12 compared to 2015-16. This indicates the trend in series.
This concludes that series has seasonal pattern and also trend. While modeling data, both seasonal and trend parameters should be included.
Figure 2 Quarterly GDP for Finance, Insurance, and Real Estate & Business Services
Auto correlation of this series is decreasing with increase in time lag. This indicates that series has trend. For series with increasing trend, ACF is decreasing whereas for a series with decreasing trend, ACF is increasing with the increase in time lag.
While modeling data one should know that nearer time lag has significant impact in values of series. Each of the present value is function of the immediate past. This is visible in series with trend.
Lag | Autocorrelation |
1 | .927 |
2 | .847 |
3 | .772 |
4 | .699 |
5 | .626 |
6 | .549 |
7 | .481 |
8 | .420 |
9 | .356 |
10 | .290 |
11 | .231 |
12 | .177 |
13 | .121 |
14 | .066 |
15 | .015 |
16 | -.030 |
Figure 3 Nifty open index
Lag | Autocorrelation |
1 | .998 |
2 | .995 |
3 | .993 |
4 | .991 |
5 | .988 |
6 | .986 |
7 | .984 |
8 | .982 |
9 | .980 |
10 | .978 |
11 | .976 |
12 | .974 |
13 | .971 |
14 | .969 |
15 | .967 |
16 | .965 |
ACF for Nifty open index indicates that all auto correlations are significant at all time lags considered. Values of all auto correlations are almost similar. This implies that each of the indexes are significantly correlated with past value for a larger time lag.
Series does not have trend or seasonal or cyclical fluctuations.
Figure 5. Total no of cars sold during 1963 to 2016
Lag | Autocorrelation |
1 | .545 |
2 | .337 |
3 | .104 |
4 | -.084 |
5 | -.341 |
6 | -.349 |
7 | -.324 |
8 | -.100 |
9 | -.077 |
10 | .116 |
11 | .199 |
12 | .310 |
13 | .065 |
14 | -.096 |
15 | -.235 |
16 | -.217 |
ACF for this series exhibits cyclical patterns. Period of the cycle is not same.
In this series seasonal fluctuations absent and trend is very marginal.
II Developing Appropriate Model
Tentative model/method is decided for the time series based on what we explored in earlier section.
Basic methods of time series analysis are as shown in the chart in figure 5
Smoothing Methods
Smoothing methods are based on smoothing the variations over the past time periods. Smoothed value for the last time periods is used to forecast the value of time series. Various methods of smoothing are suggested by the different researchers. These methods are about different schemes of assigning weights to the past values for smoothing variations. Most widely used smoothing methods are;
- Simple moving average method
- Simple exponential method
- Holt’s exponential method
- Winter’s exponential method
Simple Moving Average Method
Is simplest method of forecasting in which fluctuations are smoothed out over most recent past values. What needs to be decided is what should be the recent past period for smoothing. Let Yt be the value of the variable for time t and k be the time lag for smoothing. Moving average for period t+1, Yt+1=(Yt+Yt-1+……..Yt-k+1)/k . Each new data point is included in average and earliest one is discarded. Yt+2= = (Yt+1+Yt+…….Yt-k+2)/k
Concept of moving average can be explained through the example of time series data on annual domestic power consumption in Gujarat state given in table.
Sl No | Year | Domestic Consumption (MUS) Yt | Moving average |
1 | 1988-89 | 1393 | #N/A |
2 | 1989-90 | 1595 | #N/A |
3 | 1990-91 | 1756 | 1581.333 |
4 | 1991-92 | 1942 | 1764.333 |
5 | 1992-93 | 2086 | 1928 |
6 | 1993-94 | 2315 | 2114.333 |
7 | 1994-95 | 2521 | 2307.333 |
8 | 1995-96 | 2838 | 2558 |
9 | 1996-97 | 2968 | 2775.667 |
10 | 1997-98 | 3171 | 2992.333 |
11 | 1998-99 | 3486 | 3208.333 |
12 | 1999-00 | 3699 | 3452 |
13 | 2000-01 | 3981 | 3722 |
14 | 2001-02 | 3922 | 3867.333 |
15 | 2002-03 | 4136 | 4013 |
16 | 2003-04 | 4613 | 4223.667 |
17 | 2004-05 | 5026 | 4591.667 |
18 | 2005-06 | 5490 | 5043 |
19 | 2006-07 | 6097 | 5537.667 |
Consider k=3 as first 3 auto correlations are significant.
Forecast for 1991-92, = (1393+1595+1756)/3=1581.3
Forecast for 1992-93= (1595+1756+1942)/3 =1764 Continuing this till all the data points are considered. Forecast for 2007-8 =5537.7
Key question while using moving average for forecasting is what should be the time period for smoothing. There is no mathematical formula for this but one can get clue from ACF. Time lag for significant auto correlations can be considered for smoothing variation over time period.
Steps to use excel for simple moving average:
- Click on data analysis and choose moving average
- Select the input data series
- Enter the desirable value of k in interval box
- Select the output range
Output is as shown in the table.
Note: In simple moving average method, we have given equal weights to past values used for smoothing. It may not always be appropriate always. Thus, moving averages are not preferred when series has seasonal influences or trend.
Exponential Smoothing
Exponential smoothing is an improvement over moving average method we discussed before. Forecasts made using these methods are weighted averages of past values with the weights decaying exponentially as time period increases. Exponential methods produce quick and reliable forecasts for time series. Fore short-term forecast, these methods are preferred.
Three researchers have contributed in developing exponential methods. Simple exponential method was developed by Brown for time series which does not have seasonal or trend variations. Holt developed the exponential method for the series with trend and Winter’s method is for seasonal and trend variations in the data.
We shall discuss all three methods.
Simple Exponential Method
This is the simplest exponential method as its name. This method is also referred as single exponential method by some authors. This method is appropriate for forecasting when time series does not display any specific pattern of trend or seasonal variations. Simple exponential method can be considered for time series data in example 3 and 4.
In simple exponential method, forecasts are made using the formula,
Current forecast=a(Current value) +(1- a)old forecast
Mathematical equation is to derive forecast is as follow.
It can be observed that forecasts are made based on past value with weight α and weight assigned to past forecast is 1- α. α=1 implies that forecast is equal to old value and c=0 implies that current forecast = old forecast. α=1 implies that forecast is equal to old value and α=0 or closed to zero implies that current forecast = old forecast.
Thus 0<α<1 and key point is to find the optimum value of α. Alpha can be decided after iterating for various values and choose the one for which mean squared error is minimum. Most of the software provides the solution for optimum α.
Example 3. Nifty open index.
To illustrate simple exponential model, we shall consider the time series for Nifty open index we discussed in example 3. Series does not have trend or seasonal variations and so this is the appropriate model to forecast.
Exponential model can be developed in excel in following steps.
1.Go to data analysis
2. Select Exponential Smoothing
3.Select the input range (series to be used for forecasting)
4.Select the Output range
5.Enter the desirable value of alpha. OR try different values of alpha and choose the best one.
6.Output forecast is obtained
Note: Double and Triple exponential smoothing can be done by repeated smoothing of the series.
Alternative way of exponential smoothing is using f(x). Click on f(x) and choose FORECAST.ETS. Enter the required data and obtain the required forecast values.