Loading web-font TeX/Math/Italic
Skip to main content

Python and R: Is Python really faster than R?

A friend of mine asked me to code the following in R:
  1. Generate samples of size 10 from Normal distribution with \mu = 3 and \sigma^2 = 5;
  2. Compute the \bar{x} and \bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}} using the 95% confidence level;
  3. Repeat the process 100 times; then
  4. Compute the percentage of the confidence intervals containing the true mean.
 So here is what I got,

case <- function(n = 10, mu = 3, sigma = sqrt(5), p = 0.025, rep = 100){
m <- matrix(NA, nrow = rep, ncol = 4)
for(i in 1:rep){
norm <- rnorm(mean = mu, sd = sigma, n = n)
xbar <- mean(norm)
low <- xbar - qnorm(p = 1 - p) * (sigma/sqrt(n))
up <- xbar + qnorm(p = 1 - p) * (sigma/sqrt(n))
if((mu > low) && (mu < up)){
rem <- 1
} else {
rem <- 0
}
m[i, ] <- c(xbar, low, up, rem)
}
inside <- sum(m[, 4])
per <- inside / rep
desc <- paste("There are ", inside, " confidence intervals that contain ",
"the true mean (", mu, "), that is ", per, " percent of the total CIs", sep = "")
return(list(Matrix = m, Decision = desc))
}
view raw pyvsr1.R hosted with ❤ by GitHub
Staying with the default values, one would obtain

$Matrix
[,1] [,2] [,3] [,4]
[1,] 3.8195089 2.4336051 5.205413 1
[2,] 4.6144773 3.2285735 6.000381 0
[3,] 2.0731334 0.6872296 3.459037 1
[4,] 3.6641014 2.2781976 5.050005 1
[5,] 3.8393909 2.4534871 5.225295 1
: : : : :
: : : : :
[95,] 1.9025146 0.5166108 3.288418 1
[96,] 2.4069555 1.0210517 3.792859 1
[97,] 2.3141211 0.9282173 3.700025 1
[98,] 3.2102493 1.8243455 4.596153 1
[99,] 2.7873416 1.4014378 4.173245 1
[100,] 3.1422372 1.7563333 4.528141 1
$Decision
[1] "There are 95 confidence intervals that contain the true mean (3), that is 0.95 percent of the total CIs"
view raw pyvsr2.R hosted with ❤ by GitHub
The output is a list of Matrix and Decision, wherein the first column of the first list (Matrix) is the computed \bar{x}; the second and third columns are the lower and upper limits of the confidence interval, respectively; and the fourth column, is an array of ones -- if true mean is contained in the interval and zeros -- true mean not contained.

Now how fast it would be if I were to code this in Python?

I do have a prior knowledge that Python beats R in terms of speed (confirmed from Nathan's post), but out of curiosity I wasn't satisfied with that fact; and leads me to the following Python equivalent,

import numpy as np
import scipy.stats as ss
def case(n = 10, mu = 3, sigma = np.sqrt(5), p = 0.025, rep = 100):
m = np.zeros((rep, 4))
for i in range(rep):
norm = np.random.normal(loc = mu, scale = sigma, size = n)
xbar = np.mean(norm)
low = xbar - ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))
up = xbar + ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))
if (mu > low) & (mu < up):
rem = 1
else:
rem = 0
m[i, :] = [xbar, low, up, rem]
inside = np.sum(m[:, 3])
per = inside / rep
desc = "There are " + str(inside) + " confidence intervals that contain " \
"the true mean (" + str(mu) + "), that is " + str(per) + " percent of the total CIs"
return {"Matrix": m, "Decision": desc}
view raw pyvsr3.py hosted with ❤ by GitHub
Computing the elapsed time, we have
  • R
    system.time(case())
    #Output
    user system elapsed
    0.008 0.000 0.008
    view raw pyvsr4.R hosted with ❤ by GitHub
  • Python
    import time
    t0 = time.time()
    case()
    time.time() - t0
    #Output
    0.08859896659851074
    view raw pyvsr5.py hosted with ❤ by GitHub
As you can see, R executes at 0.008 seconds while Python runs at 0.089 seconds. I am surprised by this fact! I mean, what is happening with my Python? Firing up to 100000 repetitions,

system.time(case(rep = 100000))
#Output
user system elapsed
7.076 0.000 7.066
view raw pyvsr6.R hosted with ❤ by GitHub
and Python,

import time
t0 = time.time()
case(rep = 100000)
time.time() - t0
#Output
64.88077402114868
view raw pyvsr7.py hosted with ❤ by GitHub
Gets even worst! 64 seconds over 7 seconds? That's a huge difference. I don't know what is happening here, but I did my best to literally translate the R codes to Python, and yet R?

Any thoughts guys, especially to the Python gurus?

UPDATE:

I just want to include some great suggestions from the comments below. From Chad Fulton, the above python code can be optimized into the following:

import numpy as np
import scipy.stats as ss
def case2(n = 10, mu = 3, sigma = np.sqrt(5), p = 0.025, rep = 100):
scaled_crit = ss.norm.ppf(q = 1 - p) * (sigma / np.sqrt(n))
norm = np.random.normal(loc = mu, scale = sigma, size = (rep, n))
xbar = norm.mean(1)
low = xbar - scaled_crit
up = xbar + scaled_crit
rem = (mu > low) & (mu < up)
m = np.c_[xbar, low, up, rem]
inside = np.sum(m[:, 3])
per = inside / rep
desc = "There are " + str(inside) + " confidence intervals that contain " \
"the true mean (" + str(mu) + "), that is " + str(per) + " percent of the total CIs"
return {"Matrix": m, "Decision": desc}
view raw pyvsr8.py hosted with ❤ by GitHub
Translated to the proceeding R code by Willem Ligtenberg

case2 <- function(n = 10, mu = 3, sigma = sqrt(5), p = 0.025, rep = 100){
scaledCrit <- qnorm(p = 1 - p) * (sigma/sqrt(n))
norm <- matrix(data = rnorm(mean = mu, sd = sigma, n = n*rep), ncol = n, nrow = rep)
xbar <- rowMeans(norm)
low <- xbar - scaledCrit
up <- xbar + scaledCrit
rem <- (mu > low) & (mu < up)
m <- cbind(xbar, low, up, rem)
inside <- sum(m[, 4])
per <- inside / rep
desc <- paste("There are ", inside, " confidence intervals that contain ",
"the true mean (", mu, "), that is ", per, " percent of the total CIs", sep = "")
return(list(Matrix = m, Decision = desc))
}
view raw pyvsr9.R hosted with ❤ by GitHub
And another version by wiekvoet using data frame,

case3 <- function(n = 10, mu = 3, sigma = sqrt(5), p = 0.025, rep = 100){
xbar <- rowMeans(
matrix(rnorm(mean = mu, sd = sigma, n = n*rep),nrow=rep)
)
q <- qnorm(p = 1 - p) * (sigma/sqrt(n))
m <- data.frame(
xbar,
low=xbar+q,
high=xbar-q,
rem= abs(mu-xbar) < q )
inside <- sum(m$rem)
per <- inside / rep
desc <- paste("There are ", inside, " confidence intervals that contain ",
"the true mean (", mu, "), that is ", per, " percent of the total CIs", sep = "")
return(list(Matrix = m, Decision = desc))
}
view raw pyvsr10.R hosted with ❤ by GitHub
Taking the elapsed time, we have

import time
t0 = time.time()
case2(rep = 100000)
time.time() - t0
#Output
0.1463780403137207
view raw pyvsr11.py hosted with ❤ by GitHub
And in R,

system.time(case2(rep = 100000))
#Output
user system elapsed
0.296 0.016 0.315
system.time(case3(rep = 100000))
#Output
user system elapsed
0.300 0.024 0.322
view raw pyvsr12.R hosted with ❤ by GitHub
Python :D