Author avatar

Gaurav Singhal

Building a Twitter Sentiment Analysis in Python

Gaurav Singhal

  • Jul 1, 2020
  • 10 Min read
  • 46,886 Views
  • Jul 1, 2020
  • 10 Min read
  • 46,886 Views
Data
Data Analytics
Machine Learning
Python

Introduction

The ability to categorize opinions expressed in the text of tweets—and especially to determine whether the writer's attitude is positive, negative, or neutral—is highly valuable. In this guide, we will use the process known as sentiment analysis to categorize the opinions of people on Twitter towards a hypothetical topic called #hashtag.

There are different ordinal scales used to categorize tweets. A five-point ordinal scale includes five categories: Highly Negative, Slightly Negative, Neutral, Slightly Positive, and Highly Positive. A three-point ordinal scale includes Negative, Neutral, and Positive; and a two-point ordinal scale includes Negative and Positive. In this guide, we will use a three-point ordinal scale to categorize tweets with #hashtag.

Getting Started

Sentiment analysis involves natural language processing because it deals with human-written text. You'll have to download a few Python libraries to work with the code. Use pip install <library> to install them.

Setting Up

To train a machine learning model, we need data. You can download the dataset to use in this guide here.

Importing the required libraries.

1import pandas as pd
2import numpy as np
3import re
4import string
5from nltk.corpus import stopwords
6from nltk.tokenize import word_tokenize
7from sklearn.feature_extraction.text import TfidfVectorizer
8from sklearn.model_selection import train_test_split
9from nltk.stem import PorterStemmer
10from nltk.stem import WordNetLemmatizer
11# ML Libraries
12from sklearn.metrics import accuracy_score
13from sklearn.naive_bayes import MultinomialNB
14from sklearn.linear_model import LogisticRegression
15from sklearn.svm import SVC
16
17# Global Parameters
18stop_words = set(stopwords.words('english'))
python

Loading the Dataset

After you download the CSV, you'll see that there are 1.6 million tweets already coded into three categories by hand.

This dataset encoded the target variable with a 3-point ordinal scale: 0 = negative, 2 = neutral, 4 = positive.

1def load_dataset(filename, cols):
2    dataset = pd.read_csv(filename, encoding='latin-1')
3    dataset.columns = cols
4    return dataset
python

The dataset has six columns 'target', 't_id', 'created_at', 'query', 'user', 'text', but we are only interested in 'text', 'target'. You can include other columns also if you like. To make it scalable, you need a small script.

1def remove_unwanted_cols(dataset, cols):
2    for col in cols:
3        del dataset[col]
4    return dataset
python

Pre-processing Tweets

This is one of the essential steps in any natural language processing (NLP) task. Data scientists never get filtered, ready-to-use data. To make it workable, there is a lot of processing that needs to happen.

  • Letter casing: Converting all letters to either upper case or lower case.
  • Tokenizing: Turning the tweets into tokens. Tokens are words separated by spaces in a text.
  • Noise removal: Eliminating unwanted characters, such as HTML tags, punctuation marks, special characters, white spaces etc.
  • Stopword removal: Some words do not contribute much to the machine learning model, so it's good to remove them. A list of stopwords can be defined by the nltk library, or it can be business-specific.
  • Normalization: Normalization generally refers to a series of related tasks meant to put all text on the same level. Converting text to lower case, removing special characters, and removing stopwords will remove basic inconsistencies. Normalization improves text matching.
  • Stemming: Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem. Porter Stemmer is the most widely used technique because it is very fast. Generally, stemming chops off end of the word, and mostly it works fine.
    • Example: Working -> Work
  • Lemmatization: The goal is same as with stemming, but stemming a word sometimes loses the actual meaning of the word. Lemmatization usually refers to doing things properly using vocabulary and morphological analysis of words. It returns the base or dictionary form of a word, also known as the lemma .
    • Example: Better -> Good.
1def preprocess_tweet_text(tweet):
2    tweet.lower()
3    # Remove urls
4    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
5    # Remove user @ references and '#' from tweet
6    tweet = re.sub(r'\@\w+|\#','', tweet)
7    # Remove punctuations
8    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
9    # Remove stopwords
10    tweet_tokens = word_tokenize(tweet)
11    filtered_words = [w for w in tweet_tokens if not w in stop_words]
12    
13    #ps = PorterStemmer()
14    #stemmed_words = [ps.stem(w) for w in filtered_words]
15    #lemmatizer = WordNetLemmatizer()
16    #lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words]
17    
18    return " ".join(filtered_words)
python

Stemming is faster than lemmatization. You can uncomment the code and see how results change. Note: Do not apply both. Remember that stemming and lemmatization are normalization techniques, and it is recommended to use only one approach to normalize. Let your project requirements guide your decision, or you can always do experiments and see which one gives better results. In this case, stemming and lemmatizing yield almost the same accuracy.

  • Vectorizing Data: Vectorizing is the process to convert tokens to numbers. It is an important step because the machine learning algorithm works with numbers and not text.

In this guide, you'll implement vectorization using tf-idf. There are other techniques as well, such as Bag of Words and N-grams.

1def get_feature_vector(train_fit):
2    vector = TfidfVectorizer(sublinear_tf=True)
3    vector.fit(train_fit)
4    return vector
python

Important Note: I am using the dataset as the corpus to make a tf-idf vector. The same vector structure should be used for training and testing purposes.

The target column is comprised of the integer values 0, 2, and 4. But users do not usually want their results in this form. To convert the integer results to be easily understood by users, you can implement a small script.

1def int_to_string(sentiment):
2    if sentiment == 0:
3        return "Negative"
4    elif sentiment == 2:
5        return "Neutral"
6    else:
7        return "Positive"```
python

Bringing Everything Together

In this section, we will call all the functions that you have created. You'll see Naive Bayes and Logistic Regression algorithms for predictions. These two algorithms are quite popular in NLP, although you can try out other options too.

1# Load dataset
2dataset = load_dataset("data/training.csv", ['target', 't_id', 'created_at', 'query', 'user', 'text'])
3# Remove unwanted columns from dataset
4n_dataset = remove_unwanted_cols(dataset, ['t_id', 'created_at', 'query', 'user'])
5#Preprocess data
6dataset.text = dataset['text'].apply(preprocess_tweet_text)
7# Split dataset into Train, Test
8
9# Same tf vector will be used for Testing sentiments on unseen trending data
10tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
11X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
12y = np.array(dataset.iloc[:, 0]).ravel()
13X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
14
15# Training Naive Bayes model
16NB_model = MultinomialNB()
17NB_model.fit(X_train, y_train)
18y_predict_nb = NB_model.predict(X_test)
19print(accuracy_score(y_test, y_predict_nb))
20
21# Training Logistics Regression model
22LR_model = LogisticRegression(solver='lbfgs')
23LR_model.fit(X_train, y_train)
24y_predict_lr = LR_model.predict(X_test)
25print(accuracy_score(y_test, y_predict_lr))
python

Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.

Testing on Real-time Feeds

This step is completely optional and will only apply if you have read and implemented the guide Building a Twitter Bot with Python.

1test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
2test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
3test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])
4
5# Creating text feature
6test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
7test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())
8
9# Using Logistic Regression model for prediction
10test_prediction_lr = LR_model.predict(test_feature)
11
12# Averaging out the hashtags result
13test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
14test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
15test_result.columns = ['heashtag', 'predictions']
16test_result.predictions = test_result['predictions'].apply(int_to_string)
17
18print(test_result)
python

Replace the file name with your own in the test_file_name variable.

Conclusion

I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.

I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.

If you have any questions, feel free to reach out to me at CodeAlphabet.