I have been reading this book since last week, and now I want to share my thoughts about it. I was excited to review this because I've never heard most of the tools it features, like OpenRefine, MongoDB, and MapReduce. The book has 360 pages and surprisingly it covers a lot of topics. Along with that, is the Github repository for all the codes.
Practical Data Analysis is all about applications of statistical methodologies on computer science. I find it very useful since this was not taught in my statistics class. In college, we only practice statistics on fields like sociology, psychology, agriculture, economics, chemistry, biology, industrial engineering, and many others, but we were not onto computer science, we only deal with it when coding in R or SAS. Hal Varian once said in this video that,
. . . we've got at least hundred statisticians on Google . . .
And I was curious about that, I mean, what are they doing on Google? What are the statistical tools do they use? Thanks to this book, Hector Cuesta utilized Dynamic Time Warping (DTW) for illustrating the image similarity search which is used by Google for searching images, by using time series for comparing the distance between the photo pixels; another is classifying spam from not spam emails based on the subject line of the messages, where he demonstrates the application of Naïve Bayes algorithm for text classification (isn't that cool?); he also talk about Kernel Ridge Regression for predicting gold price using time series; the Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimensionality reduction; and then on the later chapters, it's all about "Hacking" just as what John D. Cook described on his review. Hacking data from social networking sites like facebook and twitter, how to visualize these using Gephi and make an analysis about it.