New Toy: SAS® University Edition

So I started using SAS® University Edition which is a FREE version of SAS® software. Again it's FREE, and that's the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me.

The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari.

What's in the box?

The software includes the following libraries:

Base SAS® - Make programming fast and easy with the SAS® programming language, ODS graphics and reporting procedure;
SAS/STAT® - Trust SAS® proven reliability with a wide variety of statistical methods and techniques;
SAS/IML® - Use this matrix programming language for more specialized analyses and data exploration;
SAS Studio - Reduce your programming time with autocomplete for hundreds of SAS® statements and procedures, as well as built-in syntax help;
SAS/ACCESS® - Seamlessly connect with your data, no matter where it resides.

For more about SAS® University Edition please refer to the fact sheet.

If you've been following this blog, I have been promoting free software (R, Python, and C/C++) for analysis, and the introduction of SAS® University Edition will only mean one thing, a new topic to discuss on succeeding posts. So let's welcome this software by doing analysis on it.

Analysis

Our goal here is to address the basics in order to proceed with the analysis, and thus we have the following: 1. Importing and transforming the data; 2. Descriptive statistics; 3. Hypothesis testing: One-sample t test; 4. Creating function; and, 5. Visualization.

Data

We'll use again the Volume of Palay Production (1994 to 2013 quarterly) from Cordillera Administrative Region (CAR) Philippines. To reproduce this article, please click here to download the data.

Importing and transforming the data
Working in SAS® Studio, requires you to upload your data into it. To do this, hover to the sidebar, click on Folders tab, and there you will find the "up arrow" for upload. See picture below

You are now set to import the data using the following code. As for my case, the location of the uploaded data seen from the above photo is in "/folders/myfolders/palay.csv",

In SAS®, proc refers to procedure, where in this case we perform the import procedure. out is the path where the SAS® data is saved, here we saved it in "Work" folder with filename "palay". getnames determines whether to generate SAS® variable names from the data values in the first record of the imported file. Finally, datarow starts reading data from the specified row number in the delimited text file.

I want to emphasize that the description of the arguments of the statements and procedures above is available in the software itself, thanks to SAS® Studio, autocomplete for hundreds of SAS® statements and procedures is very handy. So that in the proceeding codes, we will give description on selected statements only. Below is the autocomplete feature of SAS® Studio seen in action,

Now that we have the data in our workspace, let's do some transformation on it. In R, we always start by viewing the head of the data or the first few observations of the data, and we code it as head(data). Having that habit, here's how to do it in SAS®, in this case, first five observations,

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga	Mt_Province
1	1243	2934	148	3300	10553	2675
2	4158	9235	4287	8063	35257	1920
3	1787	1922	1955	1074	4544	6955
4	17152	14501	3536	19607	31687	2715
5	1266	2385	2530	3315	8520	2601

If you want to start and end on specific row, you can do the following. In this case, from 5th row to 10th row:

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga	Mt_Province
5	1266	2385	2530	3315	8520	2601
6	5576	7452	771	13134	28252	1242
7	927	1099	2796	5134	3106	9145
8	21540	17038	2463	14226	36238	2465
9	1039	1382	2592	6842	4973	2624
10	5424	10588	1064	13828	40140	1237

Now, what about playing with the variables of the data? Say we want to view a specific column only, assuming observations from row 15 to 20 of the Benguet variable, how is that? Well, I humbly present to you the following code,

Obs	Benguet
15	2847
16	2942
17	2119
18	734
19	2302
20	2598

For viewing multiple columns, simply enumerate the name of the variables using either keep -- keeps the variables to be returned, or drop -- drops the variables, excluded in the printing.

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga
15	1048	1427	2847	5526	4402
16	25679	15661	2942	14452	33717
17	1055	2191	2119	5882	7352
18	5437	6461	734	10477	24494
19	1029	1183	2302	6438	3316
20	23710	12222	2598	8446	26659

I think above are enough demonstrations for data transformation.

Perform descriptive statistics
And as always, next step is to look on the descriptive statistics of the data, and here's how to do it,

Variable	N	Mean	Std Dev	Minimum	Maximum
Abra Apayao Benguet Ifugao Kalinga Mt_Province	79 79 79 79 79 79	12874.38 16860.65 3237.39 12414.62 30446.42 4506.20	16746.47 15448.15 1588.54 5034.28 22245.71 3815.71	927.0000000 401.0000000 148.0000000 1074.00 2346.00 382.0000000	60303.00 54625.00 8813.00 21031.00 68663.00 13038.00

In case you want to view few or more statistics, you can try

We'll end this section with the following scatter plot matrix,

A quick analysis, we see a strong positive relationship between Kalinga and Apayao; and relationship between Ifugao and Benguet base on the above scatter plot matrix.

Hypothesis testing: One-sample t test
Let's perform simple hypothesis testing, the one-sample t test. Using 0.05 level of significance we'll test whether the true mean of Abra is not equal to 15000.

N Mean Std Dev Std Err Minimum Maximum

79 12874.4 16746.5 1884.1 927.0 60303.0

Mean 95% CL Mean Std Dev 95% CL Std Dev

12874.4 9123.4 16625.4 16746.5 14480.9 19859.1

DF t Value Pr > |t|

78 -1.13 0.2627

From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha = 0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Graphically, the observations of the Abra variable is not normally distributed based on its Q-Q plot, although that is subjective but evidently the points clearly deviates from the line.
Creating a function
Let's create a function, we'll use the fcmp procedure. For illustration purposes, consider the standard normal function, $$ \phi(x) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{x^2}{2}\right\} $$ In SAS® we code it as follows,

To generate data from this function using do loop, consider the following:

Obs x y

1 -5.0 .000001487

2 -4.9 .000002439

3 -4.8 .000003961

4 -4.7 .000006370

5 -4.6 .000010141

And that's how you create and use a function in SAS®. For me, the function definition procedure fcmp is the best procedure to be included in SAS® version 9.2, and I'm just lucky relearning this language with this feature available, especially that it is FREE in SAS® Studio.
Visualization
Now it's time for us to create some visual art. And SAS® being a propriety software, has a lot to offer. We've demonstrate few above already, this time let's plot the data points of sn_data generated from the stdnorm function we define earlier. Here it is,

For other types of plot, simply go to the Snippets tab in the side bar of the SAS® Studio, and there you will find template codes for different types of plots. See picture below,

I will end this section with histogram and series plot.
- Histogram
- Historical

N	Mean	Std Dev	Std Err	Minimum	Maximum
79	12874.4	16746.5	1884.1	927.0	60303.0

Mean	95% CL Mean	Std Dev	95% CL Std Dev
12874.4	9123.4	16625.4	16746.5	14480.9	19859.1

DF	t Value	Pr > \|t\|
78	-1.13	0.2627

Obs	x	y
1	-5.0	.000001487
2	-4.9	.000002439
3	-4.8	.000003961
4	-4.7	.000006370
5	-4.6	.000010141

Conclusion

In conclusion, it wasn't difficult for me to relearn SAS®, not only because I have used it on few papers back in college, but also because I have programming background on R and Python, which I used as basis on understanding the grammar of the language. Overall, SAS® language is a high level language, as we see above, simple statement will give you complete results with graphics without having lengthy code. And although I used R and Python as my primary tools for research, I am happy to include SAS® on it. And despite the popularity of R in analysis, I am looking ahead to see more learners, students, and researchers even more bloggers using SAS®. That way, we can share and get ideas, techniques between communities of R, SAS®, and Python.

What about you? How's your experience with SAS® University Edition?

Data Source

Philippine Bureau of Agricultural Statistics

Reference

SAS® Documentation
r4stats.com: Data Import. From http://r4stats.com/examples/data-import/ (acccessed January 15, 2015)
SAS Learning Module: Subsetting data in SAS. From http://www.ats.ucla.edu/stat/sas/modules/subset.htm (accessed January 15, 2015)

Analysis with Programming

Search This Blog