Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

Problems

Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

$y_1 = 98$ $y_2 = 102$ $y_3=154$

$y_4 = 133$ $y_5 = 190$ $y_6=175$

We are interested in $\bar{y}_U$, the population mean. Consider eight possible samples chosen.

Sample No. Sample, $\mathcal{S}$ $P(\mathcal{S})$

1 $\{1,3,5\}$ $1/8$

2 $\{1,3,6\}$ $1/8$

3 $\{1,4,5\}$ $1/8$

4 $\{1,4,6\}$ $1/8$

5 $\{2,3,5\}$ $1/8$

6 $\{2,3,6\}$ $1/8$

7 $\{2,4,5\}$ $1/8$

8 $\{2,4,6\}$ $1/8$
1. What is the value of $\bar{y}_U$?
2. Let $\bar{y}$ be the mean of the sample values. For each sampling plan, find
  1. $\mathrm{E}\bar{y}$;
  2. $\mathrm{Var}\bar{y}$;
  3. $\mathrm{Bias}(\bar{y})$;
  4. $\mathrm{MSE}(\bar{y})$;
Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

Age (months) 9 10 11 12 13 14 15 16 17 18 19 20

Number of Children 13 35 44 69 36 24 7 3 2 5 1 1

Find the mean and SE of the age for onset of free walking.

Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

Village	$Y_j$	Village	$Y_j$	Village	$Y_j$	Village	$Y_j$
1	105	11	319	21	70	31	16
2	625	12	72	22	249	32	439
3	47	13	109	23	384	33	123
4	312	14	91	24	482	34	207
5	327	15	152	25	378	35	145
6	230	16	189	26	111	36	666
7	240	17	365	27	534	37	338
8	203	18	70	28	306	38	624
9	535	9	249	29	655	39	501
10	275	20	384	30	102	40	962

Solutions

In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.

1. The value of $\bar{y}_U$ is coded as follows:
  
  Python Code R Code
2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6\choose 3}$, first. Where the first two combinations are the samples, $\{1,2,3\}$ and $\{1,2,4\}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index $\{1,3,5\}$, having population units, $\{98, 154, 190\}$. So that the following is the code of this sampling design:
  
  Python Code R Code
  1. Now to obtain the expected value of the average of the sample data, we compute it using $\mathrm{E}\bar{y}=\sum_{k}\bar{y}_k\mathrm{P}(\bar{y}_k)=\sum_{k}\bar{y_k}\mathrm{P}(\mathcal{S}_k)$, $\forall k\in\{1,\cdots,8\}$. So for $k = 1$, $$ \begin{aligned} \bar{y}_1\mathrm{P}(\mathcal{S}_1)&=\frac{98+154+190}{3}\mathrm{P}(\mathcal{S}_1)\\ &=\frac{98+154+190}{3}\left(\frac{1}{8}\right)=18.41667. \end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $\mathrm{E}\bar{y}$. So that the following is the equivalent of this:
    
    Python Code R Code From the above code, the output tells us that $\mathrm{E}\bar{y}=140$.
  2. Next is to compute for the variance of $\bar{y}$, which is $\mathrm{Var}\bar{y}=\mathrm{E}\bar{y}^{2}-(\mathrm{E}\bar{y})^2$. So we need a function for $\mathrm{E}\bar{y}^2$, where the first term of this, $k=1$, is $\bar{y}_1^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2(\frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:
    
    Python Code R Code So that using the above output, 20182.94, and subtracting $(\mathrm{E}\bar{y})^2$ to it, will give us the variance. And hence the succeeding code:
    
    Python Code: R Code: So the variance of the $\bar{y}$ is $18.9444$.
3. The $\mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($\mathrm{E}\bar{y}=142$), so $\mathrm{Bias}=142-142=0$.
4. $\mathrm{MSE}=\mathrm{Var}\bar{y}-(\mathrm{Bias}\bar{y})^2$, and since the $\mathrm{Bias}\bar{y}=0$. So $\mathrm{MSE}=\mathrm{Var}\bar{y}$.
First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

Python Code R Code So the standard variability of the Age is 1.920824.
Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = \frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that's why we have the same population units sampled between the two language despite the differences between the index returned.

Reference

Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

Analysis with Programming

Search This Blog