We are going to explore the basics of Statistics using Python. And we'll go through the following:
- Importing the data;
- Apply summary statistics;
- Other measures of variability (variance and coefficient of variation);
- Other measures of position (percentile and decile);
- Estimate the Skewness and Kurtosis; and bonus,
- Visualize the histogram;
Data -- volume of palay (rice) production from five regions (Abra, Apayao, Benguet, Ifugao, and Kalinga) of the central Luzon, Philippines. To import this, execute the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Import the modules required | |
import pandas as pd | |
import urllib | |
# Read the raw data from github | |
data_url = urllib.urlopen("https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv") | |
df = pd.read_csv(data_url) |
To check the first and last five entries of the data, use
The descriptions of the row labels:
Therefore, the volume of production in Kalinga is more disperse than the yields on other regions. Another measures of variability (relative variability) is the coefficient of variation (CV), which is defined as
The CV is unit-less when it comes to comparisons between the dispersions of two distributions of different units of measurement. So assuming the yields from 5 regions have different units, then Abra is more variable than the other regions (Why? That is because the sample mean in Abra is more concentrated on the lower values, see the histogram below, and the range from its mean to its maximum is 47428.620253, compared to Kalinga (the one with the largest variance) with 38216.58228 only.) This makes the standard deviation of production in Abra larger than its mean.
To proceed, the measures of location, the 20^{\mathrm{th}} percentile of the data, which is the 2^{\mathrm{nd}} decile, is simply coded as,
Hence, 20% of the data in Ifugao are less than or equal to 6805.2, and the remaining 80% are more than that.
Next, utilize the
Thus, data in Abra is positively skewed and is leptokurtic; which is supported by the following histograms:
Do some customization on the histogram of Benguet,
Other measures of central tendency such as Geometric and Harmonic mean, can also be computed using
head()
and tail()
methods, respectively; and to apply the summary statistics, use the describe()
method,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df.describe() | |
# OUTPUT | |
Abra Apayao Benguet Ifugao Kalinga | |
count 79.000000 79.000000 79.000000 79.000000 79.000000 | |
mean 12874.379747 16860.645570 3237.392405 12414.620253 30446.417722 | |
std 16746.466945 15448.153794 1588.536429 5034.282019 22245.707692 | |
min 927.000000 401.000000 148.000000 1074.000000 2346.000000 | |
25% 1524.000000 3435.500000 2328.000000 8205.000000 8601.500000 | |
50% 5790.000000 10588.000000 3202.000000 13044.000000 24494.000000 | |
75% 13330.500000 33289.000000 3918.500000 16099.500000 52510.500000 | |
max 60303.000000 54625.000000 8813.000000 21031.000000 68663.000000 |
count
- number of observations;mean
- sample mean;std
- standard deviation;min
- minimum value;25%
- first quartile;50%
- second quartile or median;75%
- third quartile; andmax
- maximum value.
var()
method,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df.var() | |
# OUTPUT | |
Abra 2.804442e+08 | |
Apayao 2.386455e+08 | |
Benguet 2.523448e+06 | |
Ifugao 2.534400e+07 | |
Kalinga 4.948715e+08 |
variation
function in the scipy.stats
module,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scipy.stats as ss | |
ss.variation(df) | |
# OUTPUT | |
array([ 1.29250025, 0.91040816, 0.48756845, 0.40293766, 0.72601197]) |
To proceed, the measures of location, the 20^{\mathrm{th}} percentile of the data, which is the 2^{\mathrm{nd}} decile, is simply coded as,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ss.scoreatpercentile(df, 20) | |
# OUTPUT | |
array([ 1484. , 2749.6, 2172.4, 6805.2, 7713.2]) |
Next, utilize the
skew()
and kurt()
methods for computing the unbiased skewness and kurtosis, respectively,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df.skew() | |
# OUTPUT | |
Abra 1.685130 | |
Apayao 0.663640 | |
Benguet 0.864243 | |
Ifugao -0.125157 | |
Kalinga 0.388291 | |
df.kurt() | |
# OUTPUT | |
Abra 1.859803 | |
Apayao -1.046026 | |
Benguet 1.592915 | |
Ifugao -0.979704 | |
Kalinga -1.335408 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df.hist(xrot = 45, sharey = True) | |
# OUTPUT | |
array([[Axes(0.125,0.552174;0.215278x0.347826), | |
Axes(0.404861,0.552174;0.215278x0.347826), | |
Axes(0.684722,0.552174;0.215278x0.347826)], | |
[Axes(0.125,0.1;0.215278x0.347826), | |
Axes(0.404861,0.1;0.215278x0.347826), | |
Axes(0.684722,0.1;0.215278x0.347826)]], dtype=object) |
![]() |
Click to enlarge |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matplotlib.rcParams.update({'font.size': 11, 'font.family': 'serif'}) | |
plt.locator_params(nbins = 8) | |
hist(df.ix[:,2], color = 'orange', normed = False) | |
plt.xticks(rotation = 45) | |
plt.xlabel('Data') | |
plt.ylabel('Count') | |
plt.title('Histogram of Benguet', fontsize = 18, verticalalignment = 'bottom', color = 'brown') | |
plt.grid(True, axis = 'y', which = 'major') |

scipy.stats.mstats.gmean
and scipy.stats.mstats.hmean
functions, respectively.