Loading web-font TeX/Main/Regular
Skip to main content

Python: Numerical Descriptions of the Data

We are going to explore the basics of Statistics using Python. And we'll go through the following:
  1. Importing the data;
  2. Apply summary statistics;
  3. Other measures of variability (variance and coefficient of variation);
  4. Other measures of position (percentile and decile);
  5. Estimate the Skewness and Kurtosis; and bonus,
  6. Visualize the histogram;
Data -- volume of palay (rice) production from five regions (Abra, Apayao, Benguet, Ifugao, and Kalinga) of the central Luzon, Philippines. To import this, execute the following:

# Import the modules required
import pandas as pd
import urllib
# Read the raw data from github
data_url = urllib.urlopen("https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv")
df = pd.read_csv(data_url)
view raw dat.py hosted with ❤ by GitHub
To check the first and last five entries of the data, use head() and tail() methods, respectively; and to apply the summary statistics, use the describe() method,

df.describe()
# OUTPUT
Abra Apayao Benguet Ifugao Kalinga
count 79.000000 79.000000 79.000000 79.000000 79.000000
mean 12874.379747 16860.645570 3237.392405 12414.620253 30446.417722
std 16746.466945 15448.153794 1588.536429 5034.282019 22245.707692
min 927.000000 401.000000 148.000000 1074.000000 2346.000000
25% 1524.000000 3435.500000 2328.000000 8205.000000 8601.500000
50% 5790.000000 10588.000000 3202.000000 13044.000000 24494.000000
75% 13330.500000 33289.000000 3918.500000 16099.500000 52510.500000
max 60303.000000 54625.000000 8813.000000 21031.000000 68663.000000
view raw dat2.py hosted with ❤ by GitHub
The descriptions of the row labels:
  • count - number of observations;
  • mean - sample mean;
  • std - standard deviation;
  • min - minimum value;
  • 25% - first quartile;
  • 50% - second quartile or median;
  • 75% - third quartile; and
  • max - maximum value.
For sample variance, use the var() method,

df.var()
# OUTPUT
Abra 2.804442e+08
Apayao 2.386455e+08
Benguet 2.523448e+06
Ifugao 2.534400e+07
Kalinga 4.948715e+08
view raw dat5.py hosted with ❤ by GitHub
Therefore, the volume of production in Kalinga is more disperse than the yields on other regions. Another measures of variability (relative variability) is the coefficient of variation (CV), which is defined as variation function in the scipy.stats module,

import scipy.stats as ss
ss.variation(df)
# OUTPUT
array([ 1.29250025, 0.91040816, 0.48756845, 0.40293766, 0.72601197])
view raw dat6.py hosted with ❤ by GitHub
The CV is unit-less when it comes to comparisons between the dispersions of two distributions of different units of measurement. So assuming the yields from 5 regions have different units, then Abra is more variable than the other regions (Why? That is because the sample mean in Abra is more concentrated on the lower values, see the histogram below, and the range from its mean to its maximum is 47428.620253, compared to Kalinga (the one with the largest variance) with 38216.58228 only.) This makes the standard deviation of production in Abra larger than its mean.

To proceed, the measures of location, the 20^{\mathrm{th}} percentile of the data, which is the 2^{\mathrm{nd}} decile, is simply coded as,

ss.scoreatpercentile(df, 20)
# OUTPUT
array([ 1484. , 2749.6, 2172.4, 6805.2, 7713.2])
view raw dat7.py hosted with ❤ by GitHub
Hence, 20% of the data in Ifugao are less than or equal to 6805.2, and the remaining 80% are more than that.

Next, utilize the skew() and kurt() methods for computing the unbiased skewness and kurtosis, respectively,

df.skew()
# OUTPUT
Abra 1.685130
Apayao 0.663640
Benguet 0.864243
Ifugao -0.125157
Kalinga 0.388291
df.kurt()
# OUTPUT
Abra 1.859803
Apayao -1.046026
Benguet 1.592915
Ifugao -0.979704
Kalinga -1.335408
view raw dat4.py hosted with ❤ by GitHub
Thus, data in Abra is positively skewed and is leptokurtic; which is supported by the following histograms:

df.hist(xrot = 45, sharey = True)
# OUTPUT
array([[Axes(0.125,0.552174;0.215278x0.347826),
Axes(0.404861,0.552174;0.215278x0.347826),
Axes(0.684722,0.552174;0.215278x0.347826)],
[Axes(0.125,0.1;0.215278x0.347826),
Axes(0.404861,0.1;0.215278x0.347826),
Axes(0.684722,0.1;0.215278x0.347826)]], dtype=object)
view raw dat3.py hosted with ❤ by GitHub

Click to enlarge
Do some customization on the histogram of Benguet,

matplotlib.rcParams.update({'font.size': 11, 'font.family': 'serif'})
plt.locator_params(nbins = 8)
hist(df.ix[:,2], color = 'orange', normed = False)
plt.xticks(rotation = 45)
plt.xlabel('Data')
plt.ylabel('Count')
plt.title('Histogram of Benguet', fontsize = 18, verticalalignment = 'bottom', color = 'brown')
plt.grid(True, axis = 'y', which = 'major')
view raw dat4.py hosted with ❤ by GitHub

Other measures of central tendency such as Geometric and Harmonic mean, can also be computed using scipy.stats.mstats.gmean and scipy.stats.mstats.hmean functions, respectively.