## Posts

Showing posts from 2015

### R and Python: Gradient Descent

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by $$f(\beta) \triangleq \beta^4 - 3\beta^3 + 2.$$ We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is $$\frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2.$$ Setting this to 0 to obtain the stationary point gives us \begin{align}…

### R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares Consider the linear regression model, $y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right],$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$. The $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the deterministic part of the model that dep…

### R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i,$$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$y_i=\beta_0+\beta_1x_i+\varepsilon_i.$$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2.$$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

### Parametric Inference: Karlin-Rubin Theorem

A family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ for a univariate random variable $T$ with real-valued parameter $\theta$ has a monotone likelihood ratio (MLR) if, for every $\theta_2>\theta_1$, $g(t|\theta_2)/g(t|\theta_1)$ is a monotone (nonincreasing or nondecreasing) function of $t$ on $\{t:g(t|\theta_1)>0\;\text{or}\;g(t|\theta_2)>0\}$. Note that $c/0$ is defined as $\infty$ if $0< c$. Consider testing $H_0:\theta\leq \theta_0$ versus $H_1:\theta>\theta_0$. Suppose that $T$ is a sufficient statistic for $\theta$ and the family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ of $T$ has an MLR. Then for any $t_0$, the test that rejects $H_0$ if and only if $T >t_0$ is a UMP level $\alpha$ test, where $\alpha=P_{\theta_0}(T >t_0)$. Example 1

### SAS®: Getting Started with PROC IML

Another powerful procedure of SAS, my favorite one, that I would like to share is the PROC IML (Interactive Matrix Language). This procedure treats all objects as a matrix, and is very useful for doing scientific computations involving vectors and matrices. To get started, we are going to demonstrate and discuss the following: Creating and Shaping Matrices;Matrix Query;Subscripts;Descriptive Statistics;Set Operations;Probability Functions and Subroutine;Linear Algebra;Reading and Creating Data; Above outline is based on the IML tip sheet (see Reference 1). So to begin on the first bullet, consider the following code:

### Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

ProblemsLet $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

$y_1 = 98$$y_2 = 102$$y_3=154$$y_4 = 133$$y_5 = 190$$y_6=175 We are interested in \bar{y}_U, the population mean. Consider eight possible samples chosen. Sample No.Sample, \mathcal{S}$$P(\mathcal{S})$1$\{1,3,5\}$$1/82\{1,3,6\}$$1/8$3$\{1,4,5\}$$1/84\{1,4,6\}$$1/8$5$\{2,3,5\}$$1/86\{2,3,6\}$$1/8$7$\{2,4,5\}$$1/88\{2,4,6\}$$1/8$

### Probability Theory: Convergence in Distribution Problem

Let's solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me.

Problem Let $X_1, X_2,\cdots$ be iid (independent and identically distributed) and $X_{(n)}=\max_{1\leq i\leq n}x_i$. If $x_i\sim$ beta(1,$\beta$), find a value of $\nu$ so that $n^{\nu}(1-X_{(n)})$ converges in distribution;If $x_i\sim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution.SolutionLet $Y_n=n^{\nu}(1-X_{(n)})$, we say that $Y_n\rightarrow Y$ in distribution. If $$\lim_{n\rightarrow \infty}F_{Y_n}(y)=F_Y(y).$$ Then, \begin{aligned} \lim_{n\rightarrow\infty}F_{Y_n}(y)&=\lim_{n\rightarrow\infty}P(Y_n\leq y)=\lim_{n\rightarrow\infty}P(n^{\nu}(1-X_{(n)})\leq y)\\ &=\lim_{n\rightarrow\infty}P\left(1-X_{(n)}\leq \frac{y}{n^{\nu}}\right… ### R: How to Layout and Design an Infographic As promised from my recent article, here's my tutorial on how to layout and design an infographic in R. This article will serve as a template for more infographic design that I plan to share on future posts. Hence, we will go through the following sections: Layout - mainly handles by grid package.Design - style of the elements in the layout. Texts - use extrafont package for custom fonts;Shapes (lines and point characters) - use grid, although this package has been removed from CRAN (as of February 26, 2015), the compressed file of the source code of the package is still available. But if I am not mistaken, by default this package is included in R. You might check it first before installing.Plots - several choices for plotting data in R: base plot, lattice, or ggplot2 package. The Infographic We aim to obtain the following layout and design in the final output of our code: ### Philippine Infographic: Recapitulation on Incidents Involving Motorcycle Riding in Tandem Criminals for 2011-2013 The Philippine government has launched Open Data Philippines (data.gov.ph) last year, January 16, 2014. Accordingly, the data.gov.ph aims to make national government data searchable, accessible, and useful, with the help of the different agencies of government, and with the participation of the public. This website consolidates the data sets of different government agencies, allowing users to find specific information from a rich and continuously growing collection of public data sets. Data.gov.ph provides information on how to access these datasets and tools, such infographics and other applications, to make the information easy to understand. Users may not only view the datasets, but also share and download them as spreadsheets and other formats, for their own use. The primary goal of data.gov.ph is to foster a citizenry empowered to make informed decisions, and to promote efficiency and transparency in government. For more, check out the video: ### Python: Getting Started with Data Analysis Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following: Importing the data Importing CSV file both locally and from the web;Data transformation;Descriptive statistics of the data;Hypothesis testing One-sample t test;Visualization; andCreating custom function. Importing the data This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes: ### Multiple Random Variables Problems To probability lovers, I just want to share (and discuss) few simple problems I solved in Chapter 4 of Casella, G. and Berger, R.L. (2002). Statistical Inference. A random point (X,Y) is distributed uniformly on the square with vertices (1, 1),(1,-1),(-1,1), and (-1,-1). That is, the joint pdf is f(x,y)=\frac{1}{4} on the square. Determine the probabilities of the following events. X^2 + Y^2 < 12X-Y>0|X+Y|<1$(modified since the original$|X+Y|<2$is trivial.)Solutions:$X^2 + Y^2 < 1$We need to consider the boundary of this inequality first in the unit square, so below is the plot of$X^2 + Y^2 = 1$, ### New Toy: SAS® University Edition So I started using SAS® University Edition which is a FREE version of SAS® software. Again it's FREE, and that's the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me. The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari. ### R: Canonical Correlation Analysis on Imaging In imaging, we deal with multivariate data, like in array form with several spectral bands. And trying to come up with interpretation across correlations of its dimensions is very challenging, if not impossible. For example let's recall the number of spectral bands of AVIRIS data we used in the previous post. There are 152 bands, so in total there are 152$\cdot\$152 = 23104 correlations of pairs of random variables. How will you be able to interpret that huge number of correlations?

To engage on this, it might be better if we group these variables into two and study the relationship between these sets of variables. Such statistical procedure can be done using the canonical correlation analysis (CCA). An example of this on health sciences (from Reference 2) is variables related to exercise and health. On one hand you have variables associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run, the amount of weight lifted on bench press, …