Thursday, 28 November 2013

Book Review: Practical Data Analysis by Hector Cuesta

I have been reading this book since last week, and now I want to share my thoughts about it. I was excited to review this because I've never heard most of the tools it features, like OpenRefine, MongoDB, and MapReduce. The book has 360 pages and surprisingly it covers a lot of topics. Along with that, is the Github repository for all the codes. 

Practical Data Analysis is all about applications of statistical methodologies on computer science. I find it very useful since this was not taught in my statistics class. In college, we only practice statistics on fields like sociology, psychology, agriculture, economics, chemistry, biology, industrial engineering, and many others, but we were not onto computer science, we only deal with it when coding in R or SAS. Hal Varian once said in this video that,

. . . we've got at least hundred statisticians on Google . . .

And I was curious about that, I mean, what are they doing on Google? What are the statistical tools do they use? Thanks to this book, Hector Cuesta utilized Dynamic Time Warping (DTW) for illustrating the image similarity search which is used by Google for searching images, by using time series for comparing the distance between the photo pixels; another is classifying spam from not spam emails based on the subject line of the messages, where he demonstrates the application of Na├»ve Bayes algorithm for text classification (isn't that cool?); he also talk about Kernel Ridge Regression for predicting gold price using time series; the Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimensionality reduction; and then on the later chapters, it's all about "Hacking" just as what John D. Cook described on his review. Hacking data from social networking sites like facebook and twitter, how to visualize these using Gephi and make an analysis about it. 

Further, I also learned new visualization tool, which is making some noise on R-bloggers, that is the D3.js, I am loving it now. The brief introduction on every topic is well-explained, without needing to google it for more readings; and the step-by-step procedure for programming is easy to follow.

Sunday, 24 November 2013

R: Mapping Super Typhoon Yolanda (Haiyan) Track

After reading Enrico Tonini post, I decided to map the super typhoon Haiyan track using OpenStreetMap, maptools, and ggplot2. If mapping with googleVis was possible with 13 lines only, that can also be achieved with the packages I used; but because I play with the aesthetic of the map, thus I got more than that. The data was taken from Weather Underground, and just to be consistent with the units from JMA Best Track Data, which I utilized for mapping typhoon Labuyo (Utor), the wind speed in miles per hour (mph) was converted to knots. So here is the final output,

Labels:
  • TD - Tropical Depression
  • TS - Tropical Storm
  • TY - Typhoon
  • STY - Super Typhoon
As for the super typhoon, please don't visit my country again. I would like to thank all who prayed for Philippines, especially for countries who helped us recover from this tragedy.

Codes:

Friday, 15 November 2013

Python: Venn Diagram

Venn Diagram is very useful for visualizing operations between events/sets. So in this post, we will learn how to visualize one in Python. First, we need to install the module matplotlib-venn. Open the terminal or command prompt, and run the following code:

Now that we have it, here are the three set operations I visualized:
$A\cup B$
$A\cap B$

$A^c\cap B$

Wednesday, 6 November 2013

R: Mapping Philippine Earthquakes (October 2013)

Last month, October 15, 2013 around 8:12 am (Philippine Time), a magnitude 7.2 earthquake hit Bohol island, detroying several infrastructures and killing hundreds of residents. The Philippine Institute of Volcanology and Seismology or PhiVolcs recorded more than 3000 aftershocks, but only a fraction of these is available in their Earthquake Bulletin. There are 448 data points in total for last month's earthquakes, and here is the final output,
Adding layer for two-dimensional density of the data points, we have