One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by
\begin{equation}
f(\beta) \triangleq \beta^4 - 3\beta^3 + 2.
\end{equation}
The following plot shows the minimum of the function at \hat{\beta}=\frac{9}{4} (red line in the plot below).
We want to minimize this function with respect to \beta. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is
\begin{equation}
\frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2.
\end{equation}
Setting this to 0 to obtain the stationary point gives us
\begin{align}
\frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\
4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\
4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\
4\hat{\beta}&=9\nonumber\\
\hat{\beta}&=\frac{9}{4}.
\end{align}
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(ggplot2) | |
f <- function (beta) beta**4 - 3 * beta**3 + 2 | |
beta <- seq(-2, 4, length.out = 100) | |
qplot(beta, f(beta), xlab = expression(beta), ylab = expression(f(beta)), geom = 'line') + | |
geom_vline(xintercept = 9/4, col = 'red') |
- Initialize \mathbf{x}_{r},r=0
- while \lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu
- \mathbf{x}_{r+1}\leftarrow \mathbf{x}_{r} - \gamma\nabla f(\mathbf{x}_r)
- r\leftarrow r + 1
- end while
- return \mathbf{x}_{r} and r
where \nabla f(\mathbf{x}_r) is the gradient of the cost function, \gamma is the learning-rate parameter of the algorithm, and \nu is the precision parameter. For the function above, let the initial guess be \hat{\beta}_0=4 and \gamma=.001 with \nu=.00001. Then \nabla f(\hat{\beta}_0)=112, so that
\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.
And |\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu. Repeat the process until at some r, |\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum \hat{\beta}_{350}=2.250483\approx\frac{9}{4}.
R Script with Plot
Python Script
Obviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into \gamma=.01 (change
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Set the initial guess and other parameters | |
beta_old <- 0; beta_new <- 4; gamma <- 0.001; precision <- 0.00001 | |
# Derivative of the function f | |
f_prime <- function (beta) 4 * beta**3 - 9 * beta**2 | |
# Plot | |
dframe <- data.frame(beta = rep(beta, 2), curves = c(f(beta), f_prime(beta)), type = rep(c("f", "f'"), each = 100)) | |
p <- ggplot(data = dframe, aes(x = beta, y = curves, colour = type)) + geom_line() + | |
geom_segment(aes(x = beta_new, y = f(beta_new), xend = beta_new, yend = f_prime(beta_new), colour = 'orange'), show.legend = TRUE) + | |
xlab(expression(beta)) + ylab('y') + | |
theme(legend.position = c(.2, .85), | |
legend.background = element_rect(fill = 'gray')) + | |
scale_colour_manual( | |
name = 'Curves and Segments', | |
labels = c( | |
expression(f(beta)), | |
expression( | |
paste(frac(df(beta), paste('d', beta)), sep = "") | |
), | |
expression( | |
bar(paste(f(beta[r]), paste(f, "'", (beta[r]), sep = ""))) | |
) | |
), | |
values = c('#FF033E', '#318CE7', '#FFBF00'), | |
guide = guide_legend(direction = "horizontal", title.position = "top", | |
label.position = "right", | |
label.theme = element_text(angle = 0, size = 10)) | |
) | |
# Perform Gradient Descent | |
i <- 0 # Counter for number of iterations | |
p <- p + geom_segment(x = beta_new, y = f(beta_new), xend = beta_new, yend = f_prime(beta_new), colour = '#FFA812') | |
while (abs(beta_new - beta_old) > precision) { | |
beta_old <- beta_new | |
beta_new <- beta_old - gamma * f_prime(beta_old) | |
p <- p + geom_segment(x = beta_new, y = f(beta_new), xend = beta_new, yend = f_prime(beta_new), colour = '#FFA812') | |
i <- i + 1 | |
} | |
# Print results | |
p; i; beta_new | |
# OUTPUT | |
[1] 350 | |
[1] 2.250483 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
# Set the initial guess and other parameters | |
beta_old = 0; beta_new = 4; gamma = 0.001; precision = 0.00001 | |
# Derivative of the function f | |
def f_prime(beta): | |
return 4 * beta**3 - 9 * beta**2 | |
i = 0 # Counter for number of iterations | |
while np.abs(beta_new - beta_old) > precision: | |
beta_old = beta_new | |
beta_new = beta_old - gamma * f_prime(beta_old) | |
i = i + 1 | |
# Print Results | |
print i | |
print beta_new | |
# OUTPUT | |
350 | |
2.25048291469 |
gamma
to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.
If we try to change the starting value from 4 to .1 (change
beta_new
to .1) with \gamma=.01, the algorithm converges at 173rd iteration with estimate \hat{\beta}_{173}=2.249962\approx\frac{9}{4} (see the plot below).
Now let's consider another function known as Rosenbrock defined as
\begin{equation}
f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2.
\end{equation}
The gradient is
\begin{align}
\nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\
&=\left[\begin{array}{c}
-2(1 - w_1) - 400(w_2 - w_1^2) w_1\\
200(w_2-w_1^2)
\end{array}\right].
\end{align}
Let the initial guess be \hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right], \gamma=.0002, and \nu=.00001. Then \nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]. So that
\begin{equation}\nonumber
\hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712
\\-0.63840\end{array}\right].
\end{equation}
And \lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu. Repeat the process until at some r, \lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu. It will turn out that 23,374 iterations are needed for the desired inequality with estimate \hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841
\\0.8956111\end{array}\right], the contour plot is depicted in the figure below.
R Script with Contour Plot
Python Script
Notice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(plot3D) | |
x <- seq(-2, 2, length.out = 15) | |
y <- seq(-1, 3, length.out = 15) | |
# Rosenbrock Function | |
f1 <- function (x, y) { | |
out <- (1 - x) ** 2 + 100 * (y - x**2)**2 | |
return (out) | |
} | |
z <- outer(x, y, f1) | |
# Initial Guess | |
w_old <- matrix(c(0, 0)); w_new <- matrix(c(-1.8,-.8)) | |
gamma <- .0002 # set the learning rate | |
precision <- .00001 # set the precision | |
f1_primew1 <- function (w1, w2) { | |
out <- -2 * (1 - w1) - 400 * (w2 - w1**2) * w1 | |
return (out) | |
} | |
f1_primew2 <- function (w1, w2) { | |
out <- 200 * (w2 - w1**2) | |
return (out) | |
} | |
# Gradient Vector | |
g_vec <- function (w1, w2) { | |
matrix(c(f1_primew1(w1, w2), f1_primew2(w1, w2))) | |
} | |
# Contour Plot | |
contour2D(z = z, x = x, y = y, col = 'black', colkey = FALSE) | |
i <- 2; cx <- cy <- sx <- sy <- sz <- numeric() | |
cx[1] <- w_new[1, ]; cy[1] <- w_new[2, ] | |
# Perform Gradient Descent | |
while(norm(w_new - w_old, 'F') > precision) { | |
w_old <- w_new | |
w_new <- w_old - gamma * g_vec(w_old[1,], w_old[2,]) | |
cx[i] <- sx[i - 1] <- w_new[1, ]; cy[i] <- sy[i - 1] <- w_new[2, ] | |
sz[i - 1] <- f1(sx[i - 1], sy[i - 1]) | |
arrows2D(cx[i - 1], cy[i - 1], cx[i], cy[i], add = TRUE) | |
i <- i + 1 | |
} | |
i - 2; w_new | |
# OUTPUT | |
[1] 23374 | |
[,1] | |
[1,] 0.9464841 | |
[2,] 0.8956111 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
# Initial Guess | |
w_old = np.array([0, 0]).reshape((2,1)) | |
w_new = np.array([-1.8,-.8]).reshape((2,1)) | |
gamma = .0002 # set the learning rate | |
nu = .00001 # set the precision | |
def f1_primew1(w1, w2): | |
return -2 * (1 - w1) - 400 * (w2 - w1**2) * w1 | |
def f1_primew2(w1, w2): | |
return 200 * (w2 - w1**2) | |
# Gradient Vector | |
def g_vec(w1, w2): | |
return np.array([f1_primew1(w1, w2), f1_primew2(w1, w2)]).reshape((2,1)) | |
# Perform Gradient Descent | |
r = 0 | |
while np.linalg.norm(w_old - w_new) > nu: | |
w_old = w_new | |
w_new = w_old - gamma * g_vec(w_old[0,:], w_old[1,:]) | |
r = r + 1 | |
print r | |
print w_new | |
# OUTPUT | |
23374 | |
[[ 0.94648413] | |
[ 0.89561108]] |
R Script
In my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(plot3D) | |
persp3D(x, y, z, col = ramp.col(n = 50, col = c("#FF033E", "#FFBF00", "#FF7E00", "#08E8DE", "#00FFFF", "#03C03C"), alpha = .1), border = "#808080", theta = 20, phi = 20, colkey = FALSE) | |
contour3D(x, y, z = 0, colvar = z, col = c("#FF7E00", "#FF033E"), alpha = .3, add = TRUE, colkey = FALSE) | |
points3D(sx, sy, sz, pch = 20, col = 'black', add = TRUE) |