99
The ability to categorize opinions expressed in the text of tweets—and especially to determine whether the writer's attitude is positive, negative, or neutral—is highly valuable. In this guide, we will use the process known as sentiment analysis to categorize the opinions of people on Twitter towards a hypothetical topic called #hashtag.
There are different ordinal scales used to categorize tweets. A five-point ordinal scale includes five categories: Highly Negative, Slightly Negative, Neutral, Slightly Positive, and Highly Positive. A three-point ordinal scale includes Negative, Neutral, and Positive; and a two-point ordinal scale includes Negative and Positive. In this guide, we will use a three-point ordinal scale to categorize tweets with #hashtag.
Sentiment analysis involves natural language processing because it deals with human-written text. You'll have to download a few Python libraries to work with the code. Use pip install <library>
to install them.
To train a machine learning model, we need data. You can download the dataset to use in this guide here.
Importing the required libraries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import pandas as pd import numpy as np import re import string from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer # ML Libraries from sklearn.metrics import accuracy_score from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC # Global Parameters stop_words = set(stopwords.words('english'))
After you download the CSV, you'll see that there are 1.6 million tweets already coded into three categories by hand.
This dataset encoded the target variable with a 3-point ordinal scale: 0 = negative, 2 = neutral, 4 = positive.
1 2 3 4
def load_dataset(filename, cols): dataset = pd.read_csv(filename, encoding='latin-1') dataset.columns = cols return dataset
The dataset has six columns 'target', 't_id', 'created_at', 'query', 'user', 'text', but we are only interested in 'text', 'target'. You can include other columns also if you like. To make it scalable, you need a small script.
1 2 3 4
def remove_unwanted_cols(dataset, cols): for col in cols: del dataset[col] return dataset
This is one of the essential steps in any natural language processing (NLP) task. Data scientists never get filtered, ready-to-use data. To make it workable, there is a lot of processing that needs to happen.
nltk
library, or it can be business-specific.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
def preprocess_tweet_text(tweet): tweet.lower() # Remove urls tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE) # Remove user @ references and '#' from tweet tweet = re.sub(r'\@\w+|\#','', tweet) # Remove punctuations tweet = tweet.translate(str.maketrans('', '', string.punctuation)) # Remove stopwords tweet_tokens = word_tokenize(tweet) filtered_words = [w for w in tweet_tokens if not w in stop_words] #ps = PorterStemmer() #stemmed_words = [ps.stem(w) for w in filtered_words] #lemmatizer = WordNetLemmatizer() #lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words] return " ".join(filtered_words)
Stemming is faster than lemmatization. You can uncomment the code and see how results change. Note: Do not apply both. Remember that stemming and lemmatization are normalization techniques, and it is recommended to use only one approach to normalize. Let your project requirements guide your decision, or you can always do experiments and see which one gives better results. In this case, stemming and lemmatizing yield almost the same accuracy.
In this guide, you'll implement vectorization using tf-idf. There are other techniques as well, such as Bag of Words and N-grams.
1 2 3 4
def get_feature_vector(train_fit): vector = TfidfVectorizer(sublinear_tf=True) vector.fit(train_fit) return vector
Important Note: I am using the dataset as the corpus to make a tf-idf vector. The same vector structure should be used for training and testing purposes.
The target column is comprised of the integer values 0, 2, and 4. But users do not usually want their results in this form. To convert the integer results to be easily understood by users, you can implement a small script.
1 2 3 4 5 6 7
def int_to_string(sentiment): if sentiment == 0: return "Negative" elif sentiment == 2: return "Neutral" else: return "Positive"```
In this section, we will call all the functions that you have created. You'll see Naive Bayes and Logistic Regression algorithms for predictions. These two algorithms are quite popular in NLP, although you can try out other options too.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# Load dataset dataset = load_dataset("data/training.csv", ['target', 't_id', 'created_at', 'query', 'user', 'text']) # Remove unwanted columns from dataset n_dataset = remove_unwanted_cols(dataset, ['t_id', 'created_at', 'query', 'user']) #Preprocess data dataset.text = dataset['text'].apply(preprocess_tweet_text) # Split dataset into Train, Test # Same tf vector will be used for Testing sentiments on unseen trending data tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel()) X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel()) y = np.array(dataset.iloc[:, 0]).ravel() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30) # Training Naive Bayes model NB_model = MultinomialNB() NB_model.fit(X_train, y_train) y_predict_nb = NB_model.predict(X_test) print(accuracy_score(y_test, y_predict_nb)) # Training Logistics Regression model LR_model = LogisticRegression(solver='lbfgs') LR_model.fit(X_train, y_train) y_predict_lr = LR_model.predict(X_test) print(accuracy_score(y_test, y_predict_lr))
Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.
This step is completely optional and will only apply if you have read and implemented the guide Building a Twitter Bot with Python.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv" test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"]) test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"]) # Creating text feature test_ds.text = test_ds["text"].apply(preprocess_tweet_text) test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel()) # Using Logistic Regression model for prediction test_prediction_lr = LR_model.predict(test_feature) # Averaging out the hashtags result test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr}) test_result = test_result_ds.groupby(['hashtag']).max().reset_index() test_result.columns = ['heashtag', 'predictions'] test_result.predictions = test_result['predictions'].apply(int_to_string) print(test_result)
Replace the file name with your own in the test_file_name
variable.
I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.
I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.
If you have any questions, feel free to reach out to me at CodeAlphabet.
99