Skip to main content

Posts

Showing posts from 2015

R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares Consider the linear regression model, \[ y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right], \] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$. The $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the deterministic part of the model that dep…

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$ y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=\beta_0+\beta_1x_i+\varepsilon_i. $$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

Parametric Inference: Karlin-Rubin Theorem

A family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ for a univariate random variable $T$ with real-valued parameter $\theta$ has a monotone likelihood ratio (MLR) if, for every $\theta_2>\theta_1$, $g(t|\theta_2)/g(t|\theta_1)$ is a monotone (nonincreasing or nondecreasing) function of $t$ on $\{t:g(t|\theta_1)>0\;\text{or}\;g(t|\theta_2)>0\}$. Note that $c/0$ is defined as $\infty$ if $0< c$. Consider testing $H_0:\theta\leq \theta_0$ versus $H_1:\theta>\theta_0$. Suppose that $T$ is a sufficient statistic for $\theta$ and the family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ of $T$ has an MLR. Then for any $t_0$, the test that rejects $H_0$ if and only if $T >t_0$ is a UMP level $\alpha$ test, where $\alpha=P_{\theta_0}(T >t_0)$. Example 1
To better understand the theorem, consider a single observation, $X$, from $\mathrm{n}(\theta,1)$, and test the following hypotheses: $$ H_0:\theta\leq \theta_0\quad\mathrm{versus}\quad H_1:\theta>\theta_0. …

Parametric Inference: Likelihood Ratio Test Problem 2

More on Likelihood Ratio Test, the following problem is originally from Casella and Berger (2001), exercise 8.12.

Problem For samples of size $n=1,4,16,64,100$ from a normal population with mean $\mu$ and known variance $\sigma^2$, plot the power function of the following LRTs (Likelihood Ratio Tests). Take $\alpha = .05$. $H_0:\mu\leq 0$ versus $H_1:\mu>0$$H_0:\mu=0$ versus $H_1:\mu\neq 0$ Solution The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})}{\displaystyle\sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})}, \;\text{since }\sigma^2\text{ is known}. $$ The denominator can be expanded as follows: $$ \begin{aligned} \sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})&=\sup_{-\infty<\mu<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\mu<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\…

Parametric Inference: Likelihood Ratio Test Problem 1

Another post for mathematical statistics, the problem below is originally from Casella and Berger (2001) (see Reference 1), exercise 8.6.

ProblemSuppose that we have two independent random samples $X_1,\cdots, X_n$ are exponential($\theta$), and $Y_1,\cdots, Y_m$ are exponential($\mu$). Find the LRT (Likelihood Ratio Test) of $H_0:\theta=\mu$ versus $H_1:\theta\neq\mu$. Show that the test in part (a) can be based on the statistic $$ T=\frac{\sum X_i}{\sum X_i+\sum Y_i}. $$ Find the distribution of $T$ when $H_0$ is true.Solution The Likelihood Ratio Test is given by $$ \lambda(\mathbf{x},\mathbf{y}) = \frac{\displaystyle\sup_{\theta = \mu,\mu>0}\mathrm{P}(\mathbf{x},\mathbf{y}|\theta,\mu)}{\displaystyle\sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)}, $$ where the denominator is evaluated as follows: $$ \sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)= \sup_{\theta > 0}\mathrm{P}(\mathbf{x}|\theta)\sup_{\mu > 0}\mathrm{P…

Parametric Inference: The Power Function of the Test

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:\theta=\theta_0$ vs $H_1:\theta>\theta_0$, where $\theta$ is the true value of the parameter, and $\theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $\theta\in\Theta_0$, where $\Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type…

Parametric Inference: Likelihood Ratio Test by Example

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1. Definition. The likelihood ratio test statistic for testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$ is \begin{equation} \label{eq:lrt} \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\in\Theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\in\Theta}L(\theta|\mathbf{x})}. \end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form $\{\mathbf{x}:\lambda(\mathbf{x})\leq c\}$, where $c$ is any number satisfying $0\leq c \leq 1$. The numerator of equation (\ref{eq:lrt}) gives us the supremum probability of the parameter, $\theta$, over the restricted domain (null hypothesis, $\Theta_0$) of the parameter space $\Theta$, that maximizes the joint probability of the sample, $\mathbf{…

SAS®: Getting Started with PROC IML

Another powerful procedure of SAS, my favorite one, that I would like to share is the PROC IML (Interactive Matrix Language). This procedure treats all objects as a matrix, and is very useful for doing scientific computations involving vectors and matrices. To get started, we are going to demonstrate and discuss the following: Creating and Shaping Matrices;Matrix Query;Subscripts;Descriptive Statistics;Set Operations;Probability Functions and Subroutine;Linear Algebra;Reading and Creating Data; Above outline is based on the IML tip sheet (see Reference 1). So to begin on the first bullet, consider the following code:

Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

ProblemsLet $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

$y_1 = 98$$y_2 = 102$$y_3=154$$y_4 = 133$$y_5 = 190$$y_6=175$
We are interested in $\bar{y}_U$, the population mean. Consider eight possible samples chosen.

Sample No.Sample, $\mathcal{S}$$P(\mathcal{S})$1$\{1,3,5\}$$1/8$2$\{1,3,6\}$$1/8$3$\{1,4,5\}$$1/8$4$\{1,4,6\}$$1/8$5$\{2,3,5\}$$1/8$6$\{2,3,6\}$$1/8$7$\{2,4,5\}$$1/8$8$\{2,4,6\}$$1/8$

Probability Theory: Convergence in Distribution Problem

Let's solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me.

Problem Let $X_1, X_2,\cdots$ be iid (independent and identically distributed) and $X_{(n)}=\max_{1\leq i\leq n}x_i$. If $x_i\sim$ beta(1,$\beta$), find a value of $\nu$ so that $n^{\nu}(1-X_{(n)})$ converges in distribution;If $x_i\sim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution.SolutionLet $Y_n=n^{\nu}(1-X_{(n)})$, we say that $Y_n\rightarrow Y$ in distribution. If $$\lim_{n\rightarrow \infty}F_{Y_n}(y)=F_Y(y).$$ Then, $$ \begin{aligned} \lim_{n\rightarrow\infty}F_{Y_n}(y)&=\lim_{n\rightarrow\infty}P(Y_n\leq y)=\lim_{n\rightarrow\infty}P(n^{\nu}(1-X_{(n)})\leq y)\\ &=\lim_{n\rightarrow\infty}P\left(1-X_{(n)}\leq \frac{y}{n^{\nu}}\right…

R: How to Layout and Design an Infographic

As promised from my recent article, here's my tutorial on how to layout and design an infographic in R. This article will serve as a template for more infographic design that I plan to share on future posts. Hence, we will go through the following sections:
Layout - mainly handles by grid package.Design - style of the elements in the layout. Texts - use extrafont package for custom fonts;Shapes (lines and point characters) - use grid, although this package has been removed from CRAN (as of February 26, 2015), the compressed file of the source code of the package is still available. But if I am not mistaken, by default this package is included in R. You might check it first before installing.Plots - several choices for plotting data in R: base plot, lattice, or ggplot2 package. The Infographic We aim to obtain the following layout and design in the final output of our code:

Philippine Infographic: Recapitulation on Incidents Involving Motorcycle Riding in Tandem Criminals for 2011-2013

The Philippine government has launched Open Data Philippines (data.gov.ph) last year, January 16, 2014. Accordingly, the data.gov.ph aims to make national government data searchable, accessible, and useful, with the help of the different agencies of government, and with the participation of the public. This website consolidates the data sets of different government agencies, allowing users to find specific information from a rich and continuously growing collection of public data sets.

Data.gov.ph provides information on how to access these datasets and tools, such infographics and other applications, to make the information easy to understand. Users may not only view the datasets, but also share and download them as spreadsheets and other formats, for their own use.

The primary goal of data.gov.ph is to foster a citizenry empowered to make informed decisions, and to promote efficiency and transparency in government. For more, check out the video:

Python: Getting Started with Data Analysis

Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following:
Importing the data Importing CSV file both locally and from the web;Data transformation;Descriptive statistics of the data;Hypothesis testing One-sample t test;Visualization; andCreating custom function. Importing the data This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

Multiple Random Variables Problems

To probability lovers, I just want to share (and discuss) few simple problems I solved in Chapter 4 of Casella, G. and Berger, R.L. (2002). Statistical Inference.
A random point $(X,Y)$ is distributed uniformly on the square with vertices $(1, 1),(1,-1),(-1,1),$ and $(-1,-1)$. That is, the joint pdf is $f(x,y)=\frac{1}{4}$ on the square. Determine the probabilities of the following events. $X^2 + Y^2 < 1$$2X-Y>0$$|X+Y|<1$ (modified since the original $|X+Y|<2$ is trivial.)Solutions:$X^2 + Y^2 < 1$
We need to consider the boundary of this inequality first in the unit square, so below is the plot of $X^2 + Y^2 = 1$,

New Toy: SAS® University Edition

So I started using SAS® University Edition which is a FREE version of SAS® software. Again it's FREE, and that's the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me.

The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari.

R: Canonical Correlation Analysis on Imaging

In imaging, we deal with multivariate data, like in array form with several spectral bands. And trying to come up with interpretation across correlations of its dimensions is very challenging, if not impossible. For example let's recall the number of spectral bands of AVIRIS data we used in the previous post. There are 152 bands, so in total there are 152$\cdot$152 = 23104 correlations of pairs of random variables. How will you be able to interpret that huge number of correlations?

To engage on this, it might be better if we group these variables into two and study the relationship between these sets of variables. Such statistical procedure can be done using the canonical correlation analysis (CCA). An example of this on health sciences (from Reference 2) is variables related to exercise and health. On one hand you have variables associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run, the amount of weight lifted on bench press, …