I have been reading this book since last week, and now I want to share my thoughts about it. I was excited to review this because I've never heard most of the tools it features, like OpenRefine, MongoDB, and MapReduce. The book has 360 pages and surprisingly it covers a lot of topics. Along with that, is the Github repository for all the codes.
Practical Data Analysis is all about applications of statistical methodologies on computer science. I find it very useful since this was not taught in my statistics class. In college, we only practice statistics on fields like sociology, psychology, agriculture, economics, chemistry, biology, industrial engineering, and many others, but we were not onto computer science, we only deal with it when coding in R or SAS. Hal Varian once said in this video that,
. . . we've got at least hundred statisticians on Google . . .
And I was curious about that, I mean, what are they doing on Google? What are the statistical tools do they use? Thanks to this book, Hector Cuesta utilized Dynamic Time Warping (DTW) for illustrating the image similarity search which is used by Google for searching images, by using time series for comparing the distance between the photo pixels; another is classifying spam from not spam emails based on the subject line of the messages, where he demonstrates the application of Naïve Bayes algorithm for text classification (isn't that cool?); he also talk about Kernel Ridge Regression for predicting gold price using time series; the Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimensionality reduction; and then on the later chapters, it's all about "Hacking" just as what John D. Cook described on his review. Hacking data from social networking sites like facebook and twitter, how to visualize these using Gephi and make an analysis about it.
Some of the issues I found are the unconsistency of the file name between the Github repository and the book itself, it gets you confuse, like the pokemonByType.csv in Github, is named as sumPokemon.csv in the book; in Chapter 2, working with OpenRefine, the column names of the Excel data in Github are in different language (I think spanish), while in the book it's in English; another is with the code, the D3.js charts in Chapter 3, such as the bar and pie charts did not work on my machine, I am new to D3.js and so I was not able to fix it immediately, but despite that, I got a quick response after sending an issue to the author. He even said, if I can help you in anything else don't hesitate to ask. So there is nothing to worry, it is a minor issue, just to caution you.