FORMCEPT’s Exclusive Analysis of GST Tweets – A Step by Step Approach

A while back, while the buzz around GST was at its peak, we at FORMCEPT brought to you an interesting analysis of over 20K tweets on GST, the insights from which are captured in the image. We believe that many of you would like to know how we went about conducting this analysis, and how you can also do similar analyses at your end. Therefore, in this blog we will take you through the process of end-to-end sentiment analysis that we have done for GST on Twitter, and introduce to you some of our APIs and other features that you can use. We went through four different stages in order to conduct sentiment analysis of GST on Twitter:

Methodology and Planning
Data Collection
Data Transformation
Insights Generation

Methodology and Approach

Firstly, we planned the approach to be used for this analysis and the methodology to be followed. Nature and Size of Sample: To keep the insights as relevant and recent as possible, a sample size of at least 20,000 tweets was decided to be taken from the period 3/7/2017 to 9/7/2017 (since the GST was launched on 1st July, 2017).

Parameters: After considering about 12 different parameters that can be used for developing insights, we chose the following as the parameters to be focused on in our analysis - location (city), most mentioned users, most influential users, most frequently used hashtags (apart from #GST) and most used words.

We will now discuss each of the remaining steps one by one along with code snippets, wherever required.

Data Collection

After planning the basics, the next step was to filter the tweets by #GST and compile a database of the required number of such tweets. For this, FORMCEPT's Twitter Fetcher API is the tool to be used. In order to crawl data from Twitter, you will need an account with one or more applications - each of these applications can crawl a maximum of 200 tweets over a period of 15 minutes. The good news is that with our Twitter Fetcher API, you need not build the application from the scratch - simply build it on top of this API which already has the necessary specifications built-in, and you are good to go.

Using the Twitter Fetcher API requires the user to send certain parameters, a sample of which is shown in the table below. Of these parameters, the first four (access_key, access_secret, consumer_key, and consumer_secret) are already provided by Twitter. These are called 'tokens'. The rest come bundled with the API.

‍Parameter Value

‍access_key <your access key>

access_secret <your access key secret>

consumer_key <your consumer key>

consumer_secret <your consumer key secret>

language <language of choice for tweet> (eg. "en" for english)

min_tweet_length <minimum length of the tweet> (eg. 60 for 60 chars)

retry <whether to retry or not incase of exception> (eg. true)

tweets <total number of tweets> (eg. 500)

txt_hashtags <hashtags of interest> (eg. GST, IncomeTax)

To know more about getting your Twitter application credentials click here.

Data Transformation

The Twitter Fetcher API parses and structures the relevant information for the user. The image below shows a specimen of information processed by the API:

‍Input: #GST is going to implmnt at Home #JammuKashmir tonight 00:01 hrs... It is the historical moment after 1947 #DelhiJammuSrinagarParallelPower

‍Output: <hashtag> is going to implement at home <hashtag> tonight 00:01 hrs.it is the historical moment after <date><hashtag>

‍However, tweets are always a challenge when it comes to analysis, since most Twitter users tend to write in shorthand lingo - such as 'm' instead of 'am', 'plz' instead of 'please', 'omg' instead of 'oh my God', and so on.

To mitigate this problem, FORMCEPT's Text Normalization API comes to the rescue. To use this API, first the user has to specify the entities that need to be normalized in the tweets. For this particular analysis, it was necessary for us to normalize hyperlinks, remove emojis, remove trailing subsequent white spaces and punctuation. Accordingly, we have set their respective variables to true and rest to false.

Insight Generation

"The Purpose of Computing is Insight, Not Numbers"

- Richard Hamming

Our initial analysis of the #GST tweets led to a plethora of information, which we then boiled down to the most interesting insights:

Sentiment Analysis Across Cities
Top Influencers - Most Retweeted
Top User Mentions - Most Tagged
Top Hashtags
Top Words

Top WordsTo identify the words most frequently used in the #GST tweets, we first did pre-processing on the tweets so that the analysis is accurate and relevant. This included:

Removing stop-words from the tweets (e.g. 'a', 'an', 'the', 'to'...etc.)
Filtering out words whose repetition frequency was outside a certain range, (e.g. if we pre-decided the range as X-Y, then words which were repeated less than X times or more than Y times were automatically filtered out)

After pre-processing, we broke down the pre-processed tweets into specific words by identifying the word boundary and then counted the occurrence of each word across the entire sample. After tallying the totals for each word, the words with the maximum times of occurrence made it to the top 10.

‍Sentiment Analysis

‍GST being an issue on which a lot of opinions - both in favour and against - were voiced across all media, we considered it imperative to do a sentiment analysis of the tweets and classify them into 3 categories - positive, negative and neutral.

For this, we used pre-trained machine learning model - FORMCEPT's Sentiment Analysis API. It had been fed a huge amount of training data to enable it to learn the detection of sentiments from a piece of text and classify the sentiment as positive, negative or neutral.

The model takes in a single parameter as text input, which in this case is the test data or the #GST tweets. Since our API returns the geo-locations of the tweet as well, we further grouped the sentiment distribution across top metropolitan cities.

‍Top Hashtags

‍Hashtags are critical to keyword search and hence determine a tweet's reach and discoverability, which prompted us to analyze the top hashtags used (apart from #GST). This task was accomplished using regular expressions. First, we tabulated a list of all the hashtags that people had used across tweets on GST, and then we did a total count of occurrence per hashtag. Hashtags with the maximum frequency of occurrence were selected as the top 10 hashtags.

‍Top Influencers

‍Influencers are opinion leaders who shape opinions of the crowd with their arguments and ideas. An influencer's opinions are usually shared by his / her followers - in the case of Twitterati, this is usually indicated by retweets. Accordingly, we identified the influencers by doing a head-count of the retweets per influencer, and ranked them such that the influencer with the highest number of retweets is ranked as #1.

‍Top Mentions

‍Twitter mentions are a significant indicator of the top-of-mind recall of prominent users in the public mind. To identify the top mentions, we again used regular expressions.

‍Other Analysis

‍We also did an analysis of the shift in the trending hashtags over the week that immediately followed the announcement of GST. As shown in the image below, the 'GST Song', that was launched a day or two after the announcement of GST, was most referred to by Twitterati in the initial few days. After that, the tweets shifted towards the significant implications of GST on the economy as a whole.

Other than the above mentioned core analysis, we also did statistical analysis to find out the number of unique hashtags, unique user mentions, and unique influencers that were there across all the 23k tweets. Infographics to all the analysis can be found here. Stay tuned to our blog to know about more such exciting analyses and insights by FORMCEPT. To know more about what we do and how we can help you, please visit www.formcept.com or write to us at contactus@formcept.com.

FORMCEPT's Exclusive Analysis of GST Tweets - A Step by Step Approach

Methodology and Approach

Data Collection

Data Transformation

Insight Generation

Related Blogs

Recent Posts