Thursday, 15 January 2015

New Toy: SAS® University Edition

So I started using SAS® University Edition which is a FREE version of SAS® software. Again it's FREE, and that's the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me.

The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari.

What's in the box?

The software includes the following libraries:
  1. Base SAS® - Make programming fast and easy with the SAS® programming language, ODS graphics and reporting procedure;
  2. SAS/STAT® - Trust SAS® proven reliability with a wide variety of statistical methods and techniques;
  3. SAS/IML® - Use this matrix programming language for more specialized analyses and data exploration;
  4. SAS Studio - Reduce your programming time with autocomplete for hundreds of SAS® statements and procedures, as well as built-in syntax help;
  5. SAS/ACCESS® - Seamlessly connect with your data, no matter where it resides.
For more about SAS® University Edition please refer to the fact sheet.

If you've been following this blog, I have been promoting free software (R, Python, and C/C++) for analysis, and the introduction of SAS® University Edition will only mean one thing, a new topic to discuss on succeeding posts. So let's welcome this software by doing analysis on it.

Analysis

Our goal here is to address the basics in order to proceed with the analysis, and thus we have the following: 1. Importing and transforming the data; 2. Descriptive statistics; 3. Hypothesis testing: One-sample t test; 4. Creating function; and, 5. Visualization.

Data

We'll use again the Volume of Palay Production (1994 to 2013 quarterly) from Cordillera Administrative Region (CAR) Philippines. To reproduce this article, please click here to download the data.
  1. Importing and transforming the data
    Working in SAS® Studio, requires you to upload your data into it. To do this, hover to the sidebar, click on Folders tab, and there you will find the "up arrow" for upload. See picture below
    You are now set to import the data using the following code. As for my case, the location of the uploaded data seen from the above photo is in "/folders/myfolders/palay.csv",

    In SAS®, proc refers to procedure, where in this case we perform the import procedure. out is the path where the SAS® data is saved, here we saved it in "Work" folder with filename "palay". getnames determines whether to generate SAS® variable names from the data values in the first record of the imported file. Finally, datarow starts reading data from the specified row number in the delimited text file.

    I want to emphasize that the description of the arguments of the statements and procedures above is available in the software itself, thanks to SAS® Studio, autocomplete for hundreds of SAS® statements and procedures is very handy. So that in the proceeding codes, we will give description on selected statements only. Below is the autocomplete feature of SAS® Studio seen in action,
    Now that we have the data in our workspace, let's do some transformation on it. In R, we always start by viewing the head of the data or the first few observations of the data, and we code it as head(data). Having that habit, here's how to do it in SAS®, in this case, first five observations,

    Obs Abra Apayao Benguet Ifugao Kalinga Mt_Province
    1 1243 2934 148 3300 10553 2675
    2 4158 9235 4287 8063 35257 1920
    3 1787 1922 1955 1074 4544 6955
    4 17152 14501 3536 19607 31687 2715
    5 1266 2385 2530 3315 8520 2601
    If you want to start and end on specific row, you can do the following. In this case, from 5th row to 10th row:

    Obs Abra Apayao Benguet Ifugao Kalinga Mt_Province
    5 1266 2385 2530 3315 8520 2601
    6 5576 7452 771 13134 28252 1242
    7 927 1099 2796 5134 3106 9145
    8 21540 17038 2463 14226 36238 2465
    9 1039 1382 2592 6842 4973 2624
    10 5424 10588 1064 13828 40140 1237
    Now, what about playing with the variables of the data? Say we want to view a specific column only, assuming observations from row 15 to 20 of the Benguet variable, how is that? Well, I humbly present to you the following code,

    Obs Benguet
    15 2847
    16 2942
    17 2119
    18 734
    19 2302
    20 2598
    For viewing multiple columns, simply enumerate the name of the variables using either keep -- keeps the variables to be returned, or drop -- drops the variables, excluded in the printing.

    Obs Abra Apayao Benguet Ifugao Kalinga
    15 1048 1427 2847 5526 4402
    16 25679 15661 2942 14452 33717
    17 1055 2191 2119 5882 7352
    18 5437 6461 734 10477 24494
    19 1029 1183 2302 6438 3316
    20 23710 12222 2598 8446 26659
    I think above are enough demonstrations for data transformation.
  2. Perform descriptive statistics
    And as always, next step is to look on the descriptive statistics of the data, and here's how to do it,

    Variable N Mean Std Dev Minimum Maximum
    Abra
    Apayao
    Benguet
    Ifugao
    Kalinga
    Mt_Province
    79
    79
    79
    79
    79
    79
    12874.38
    16860.65
    3237.39
    12414.62
    30446.42
    4506.20
    16746.47
    15448.15
    1588.54
    5034.28
    22245.71
    3815.71
    927.0000000
    401.0000000
    148.0000000
    1074.00
    2346.00
    382.0000000
    60303.00
    54625.00
    8813.00
    21031.00
    68663.00
    13038.00
    In case you want to view few or more statistics, you can try

    We'll end this section with the following scatter plot matrix,
    A quick analysis, we see a strong positive relationship between Kalinga and Apayao; and relationship between Ifugao and Benguet base on the above scatter plot matrix.
  3. Hypothesis testing: One-sample t test
    Let's perform simple hypothesis testing, the one-sample t test. Using 0.05 level of significance we'll test whether the true mean of Abra is not equal to 15000.

    N Mean Std Dev Std Err Minimum Maximum
    79 12874.4 16746.5 1884.1 927.0 60303.0
    Mean 95% CL Mean Std Dev 95% CL Std Dev
    12874.4 9123.4 16625.4 16746.5 14480.9 19859.1
    DF t Value Pr > |t|
    78 -1.13 0.2627
    From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha = 0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Graphically, the observations of the Abra variable is not normally distributed based on its Q-Q plot, although that is subjective but evidently the points clearly deviates from the line.
  4. Creating a function
    Let's create a function, we'll use the fcmp procedure. For illustration purposes, consider the standard normal function, $$ \phi(x) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{x^2}{2}\right\} $$ In SAS® we code it as follows,

    To generate data from this function using do loop, consider the following:

    Obs x y
    1 -5.0 .000001487
    2 -4.9 .000002439
    3 -4.8 .000003961
    4 -4.7 .000006370
    5 -4.6 .000010141
    And that's how you create and use a function in SAS®. For me, the function definition procedure fcmp is the best procedure to be included in SAS® version 9.2, and I'm just lucky relearning this language with this feature available, especially that it is FREE in SAS® Studio.
  5. Visualization
    Now it's time for us to create some visual art. And SAS® being a propriety software, has a lot to offer. We've demonstrate few above already, this time let's plot the data points of sn_data generated from the stdnorm function we define earlier. Here it is,
    For other types of plot, simply go to the Snippets tab in the side bar of the SAS® Studio, and there you will find template codes for different types of plots. See picture below,
    I will end this section with histogram and series plot.
    • Histogram
    • Historical

Conclusion

In conclusion, it wasn't difficult for me to relearn SAS®, not only because I have used it on few papers back in college, but also because I have programming background on R and Python, which I used as basis on understanding the grammar of the language. Overall, SAS® language is a high level language, as we see above, simple statement will give you complete results with graphics without having lengthy code. And although I used R and Python as my primary tools for research, I am happy to include SAS® on it. And despite the popularity of R in analysis, I am looking ahead to see more learners, students, and researchers even more bloggers using SAS®. That way, we can share and get ideas, techniques between communities of R, SAS®, and Python.

What about you? How's your experience with SAS® University Edition?

Data Source

Reference

  1. SAS® Documentation
  2. r4stats.com: Data Import. From http://r4stats.com/examples/data-import/ (acccessed January 15, 2015)
  3. SAS Learning Module: Subsetting data in SAS. From http://www.ats.ucla.edu/stat/sas/modules/subset.htm (accessed January 15, 2015)

No comments:

Post a Comment