The variance measures the dispersion of a set of data. It is used to measure the precision of a sample based estimate.
Suppose you have 5 respondents who have rated the service quality of their phone provider, on a scale from 1 to 10. Here are the five ratings:
9 |
7 |
8 |
2 |
10 |
Respondent n°4 is really not happy. N°5 is very happy. The average of ratings is 7,2. The variance measures the spread from the average. For each respondent, you calculate the square of the spread from the average:
Note | Spread from average | Square of spread |
9 | 1,8 | 3,24 |
7 | -0,2 | 0,04 |
8 | 0,8 | 0,64 |
2 | -5,2 | 27,04 |
10 | 2,8 | 7,84 |
The variance is calculated as the average of the third column in the above table (7,76) : it is the average of the squared spread from the average. The more dispersion in the data, the more some of them will be away from the average, the larger the variance. If all ratings are equal, they are all equal to their mean, and the variance is equal to 0.
The standard deviation if the square root of the variance. It is a crucial ingredient in the calculation of confidence intervals.
Suppose you would like to compare the dispersion of answers to two questions rated on the same scale. What is the best indicator :
Confidence intervals Fractile
S. Kullback (1959): Information Theory and Statistics – Wiley T.S. Ferguson (1967) : Mathematical Statistics – Academic Press J.P. Lecoutre (2012) : Statistique et probabilités – Dunod A. Monfort (1982) : Cours de statistique mathématique –- Economica S.D. Silvey (1975) : Statistical inference – Chapman and Hall Quizz solution – variance The right indicator is the ratio standard deviation over average, aka variation coefficient. Indeed, the level of the variance (and of the standard deviation) depends partly on the level of the average The higher the average, the higher the variance. In order to have a true comparison of the dispersion across the two datasets, you need to factor away the level of the average. Hence the need for an indicator taking into account both the dispersion and the average.. And then, the variance is similar to a squared average. The ratio variance over average will always be higher for datasets with higher average, whatever the dispersion