Thursday, 16 April 2015

Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

Problems

  1. Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

    $y_1 = 98$$y_2 = 102$$y_3=154$
    $y_4 = 133$$y_5 = 190$$y_6=175$

    We are interested in $\bar{y}_U$, the population mean. Consider eight possible samples chosen.

    Sample No.Sample, $\mathcal{S}$$P(\mathcal{S})$
    1$\{1,3,5\}$$1/8$
    2$\{1,3,6\}$$1/8$
    3$\{1,4,5\}$$1/8$
    4$\{1,4,6\}$$1/8$
    5$\{2,3,5\}$$1/8$
    6$\{2,3,6\}$$1/8$
    7$\{2,4,5\}$$1/8$
    8$\{2,4,6\}$$1/8$

    1. What is the value of $\bar{y}_U$?
    2. Let $\bar{y}$ be the mean of the sample values. For each sampling plan, find
      1. $\mathrm{E}\bar{y}$;
      2. $\mathrm{Var}\bar{y}$;
      3. $\mathrm{Bias}(\bar{y})$;
      4. $\mathrm{MSE}(\bar{y})$;
  2. Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

    Age (months)91011121314151617181920
    Number of Children133544693624732511

    Find the mean and SE of the age for onset of free walking.
  3. Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

    Village$Y_j$Village$Y_j$Village$Y_j$Village$Y_j$
    11051131921703116
    262512722224932439
    347131092338433123
    431214912448234207
    5327151522537835145
    6230161892611136666
    7240173652753437338
    820318702830638624
    953592492965539501
    10275203843010240962

Solutions

In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.
    1. The value of $\bar{y}_U$ is coded as follows:

      Python Code R Code
    2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6\choose 3}$, first. Where the first two combinations are the samples, $\{1,2,3\}$ and $\{1,2,4\}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index $\{1,3,5\}$, having population units, $\{98, 154, 190\}$. So that the following is the code of this sampling design:

      Python Code R Code
      1. Now to obtain the expected value of the average of the sample data, we compute it using $\mathrm{E}\bar{y}=\sum_{k}\bar{y}_k\mathrm{P}(\bar{y}_k)=\sum_{k}\bar{y_k}\mathrm{P}(\mathcal{S}_k)$, $\forall k\in\{1,\cdots,8\}$. So for $k = 1$, $$ \begin{aligned} \bar{y}_1\mathrm{P}(\mathcal{S}_1)&=\frac{98+154+190}{3}\mathrm{P}(\mathcal{S}_1)\\ &=\frac{98+154+190}{3}\left(\frac{1}{8}\right)=18.41667. \end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $\mathrm{E}\bar{y}$. So that the following is the equivalent of this:

        Python Code R Code From the above code, the output tells us that $\mathrm{E}\bar{y}=140$.
      2. Next is to compute for the variance of $\bar{y}$, which is $\mathrm{Var}\bar{y}=\mathrm{E}\bar{y}^{2}-(\mathrm{E}\bar{y})^2$. So we need a function for $\mathrm{E}\bar{y}^2$, where the first term of this, $k=1$, is $\bar{y}_1^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2(\frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:

        Python Code R Code So that using the above output, 20182.94, and subtracting $(\mathrm{E}\bar{y})^2$ to it, will give us the variance. And hence the succeeding code:

        Python Code: R Code: So the variance of the $\bar{y}$ is $18.9444$.
    3. The $\mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($\mathrm{E}\bar{y}=142$), so $\mathrm{Bias}=142-142=0$.
    4. $\mathrm{MSE}=\mathrm{Var}\bar{y}-(\mathrm{Bias}\bar{y})^2$, and since the $\mathrm{Bias}\bar{y}=0$. So $\mathrm{MSE}=\mathrm{Var}\bar{y}$.
  1. First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

    Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

    Python Code R Code So the standard variability of the Age is 1.920824.
  2. Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = \frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

    Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that's why we have the same population units sampled between the two language despite the differences between the index returned.

Reference

  1. Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

No comments:

Post a Comment