Friday, 21 March 2014

R: Text Mining on Twitter #PrayForMH370 Malaysia Airlines

Warning: Twitter have redesigned the interface of their developers' page, thus the screenshots below are now useless. But this has nothing to do with the codes, so you can still use it.

It's been two weeks for search and rescue operations of the Malaysia Airlines Flight MH370, after it vanished from the radar on March 8, 2014. And wherever they are, we hope and pray for them.

Photo from VENUS - Wall of Hope & Prayers for MH370

In this post, we are going to do text data mining on Twitter tweets containing #PrayForMH370 from March 8, to March 20, 2014 using Twitter API. First, we need to have an authentication on the Twitter API, to obtain the data. In the proceeding tutorial, the idea and codes for Twitter authentication were based from Julianhi's amazing blog, and I am going to replicate his code to save a copy of it.

Go to, and sign in with your twitter account. When you're done, click on your profile picture on the top right corner like the one below.

Click on My applications, then Create New App. Fill in what's necessary (i.e. Name, Description and Website).

Julianhi recommends for valid Website, and name it anything you want (same case for description). To finish this, check Yes, I agree on Developer Rules of the Road, and click on Create your Twitter application. You will have something like this,

Keep this open on your browser, and go to your R script. Copy and paste the following code,

Replace the apiKey and apiSecret with the one on the API Keys tab of your Twitter app (see above photo -- refer back to your browser), then run the above code. If you encounter error that says Error: Unauthorized, make sure to remove the spaces (if any) both at the end of your apiKey or apiSecret between the quotations, i.e.

This happens especially when you do copy and paste (I'm guilty of that). But if there is no error, twitCred$handshake will return a link. Copy and paste this link to your browser, then authorize the app,

After clicking Authorize app, take note of the PIN and enter it to the R Console; finally, run the last function (line 24 above). Now that we have an access, we can extract data then. For this post, we will make a word cloud for #PrayforMH370 and #MH370 (bonus) on the said inclusive dates. Here it is, run the following:

The above code for creating word cloud is originally from Mining twitter with R site, and below is the output of that,
If you notice in the code, line 6 above, I set n = 1000. This line will likely to give you a warning that says,

Just ignore this, as this simply mean that there are only 599 tweets in total that use #PrayforMH370 during the inclusive dates. Now, let's investigate the above word cloud. I know very little about Bahasa Melayu (Malay Language), but I can tell the words in high scale are likely to be Malay prepositions, or in computing term, Malay stop words. This is because we exclude Malay stop words (only 'prayformh370', 'prayformh' and stopwords('english') -- English stop words) in the code as it is not yet supported by the stopwords function. Hence, not a good word cloud for tweets exploratory. So I look for a list of Malay stop words on the web, and led me to this site. Since this is what we are looking for, thus we import the list using the XML package.

Supplying the previous code with the Malay stopwords, thus we have

There is the difference, dengannya, telah and bagi were actually stop words. And base from this multilingual guy (Google Translate), the "waktu solat masuk" means "prayer time in". You can play with the words above. For #MH370, using the same codes with Malay and all available stop words from stopwords function, we have

And why is it in square? and not circle? Well that is due to the small screen I have and large number of dataset (1000 data points). So, from the two plots we have (#PrayForMH370 and #MH370), we can say subjectively, that tweets under #MH370 were mostly, maybe, retweets of media tweets since there are words like cnn, malaysia, with link like httpcodjczaqtm. There is also name of the CNN anchor andersoncooper. Another sobrevolando (spanish for flying over, Google knows it well). We can also see the word probably, since there are many guesses as to what happen when this news came out. And of course, most Twitter users from Malaysia during the inclusive dates (March 8 to 20), were expressing their prayers on the lost plane. That's what I see, what about your exploration?

No comments:

Post a Comment