2017/02/19

Extracting Twitter Data using R

Of late have been reading and spending quite a bit of time on Big Data technologies ( HDFS, Pig, Hive and Impala etc., ), Oracle Data Visualization Desktop (Oracle DVD) and R. 

To try out Big Data techniques have been looking around for large data sets. Got this crazy idea of extracting twitter data and analyze it using Oracle DVD and generate cool visuals. 

But now I am stuck.  I did not know how to pull data out of Twitter. 

Googling around found this excellent post, which details step-by-step process to extract Twitter data using R.

Pre-requisites for this are -
  • R installed (V3.3) on your desktop
  • you have a Twitter Account to create a Twitter Application.

STEPS TO CREATE A TWITTER APPLICATION
Navigate to My Applications in the upper right hand corner.
Twitter1
Navigate to My Applications in the upper right hand corner.
Twitter2
Create a new application.
TwitterC
Fill out the new app form. Names should be unique, i.e., no one else should have used this name for their Twitter app. Give a brief description of the app. You can change this later on if needed. Enter your website or blog address. Callback URL can be left blank. Once you’ve done this, make sure you’ve read the “Developer Rules Of The Road” blurb, check the “Yes, I agree” box, fill in the CAPTCHA and click the “Create Your Twitter Application” button.
Twitter3
 Scroll down and click on “Create my access token” button.
Twitter4Note the values of consumer key and consumer secret and keep them handy for future use. You should keep these secret. If anyone was to get these keys, they could effectively access your Twitter account.
Twitter5

Install and Load Required Package

R comes with a standard set of packages. A number of other packages are available for download and installation. For the purpose of this post, we will need the following packages:
–  ROAuth: Provides an interface to the OAuth 1.0 specification, allowing users to authenticate via OAuth to the server of their choice.
–  Twitter: Provides an interface to the Twitter web API.
installing and loading all the required packages.
install.packages("twitteR")
install.packages("ROAuth")
library("twitteR")
library("ROAuth")

Creating Twitter Authentication Process


The procedure worked for most of the bits, except for the step for Twitter Authentication step. Rather than using "TwitterOAuth" for authentication, which was not working, I had to replace this step with =>

load("base64enc")

 setup_twitter_oauth(Consumer_key,Consumer_secret,access_token,access_token_secret)

Where Consumer_Key, Consumer_secret, access_token and access_token_secret are to be defined and assigned proper values as per your twitter app authentication.


After this it all worked fine, was able to connect to Twitter and extract data I was looking for.


Extract Tweets


From R CLI, search for twitter tags and write them to a text file:


> tweets <- igdata="" n="100)</font" searchtwitter="">

To verify the contents of the extract

> print(head(tweets,2))
[[1]]
[1] "smuddu: Session at #gitpro2017 on  how to get ROI from Bigdata and ML? @CIOonline @strataconf #MachineLearning @bigdata #bigdata @hadoop @awscloud"

[[2]]
[1] "alevergara78: RT @bigdata: . @AnimaAnandkumar on distributed deep learning using MXNet \xed��\xed�\u008f  \xed��\xed�\u0092 #deeplearning sessions at #stratahadoop San Jose https://t.co…"

The below steps will dump the contents of the R vector "tweets" into a file:
> sink("c:/Users/mah/Documents/tweets.txt") > print(tweets) > sink()
>


The next step is to analyze the data using Oracle DVD, that's for another post.





2 comments:

File Handling with Python

This little utility is for copying files from source to target directories.  On the way it checks whether a directory exists in the target, ...