Natural Language Processing (or NLP) is the science of dealing with human language or text data. One of the NLP applications is Topic Identification, which is a technique used to discover topics across text documents.
In this guide, we will learn about the fundamentals of topic identification and modeling. Using the bag-of-words approach and simple NLP models, we will learn how to identify topics from texts.
We will start by importing the libraries we will be using in this guide.
1import nltk 2from nltk.tokenize import word_tokenize 3from collections import Counter 4nltk.download('wordnet') #download if using this module for the first time 5 6 7from nltk.stem import WordNetLemmatizer 8from nltk.corpus import stopwords 9nltk.download('stopwords') #download if using this module for the first time 10 11 12#For Gensim 13import gensim 14import string 15from gensim import corpora 16from gensim.corpora.dictionary import Dictionary 17from nltk.tokenize import word_tokenize
Bag-of-words is a simplistic method for identifying topics in a document. It works on the assumption that the higher the frequency of the term, the higher it's importance. We will see how to implement this using the text example given below:
1text1 = "Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)" 2print(text1)
1Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)
The text is on the Avengers movie, 'Infinity War'. To begin with, we will create tokens using tokenization. The first line of code below splits the text into tokens. The second line converts the tokens to lowercase and the third line prints the output.
1tokens = word_tokenize(text1) 2lowercase_tokens = [t.lower() for t in tokens] 3print(lowercase_tokens)
1['avengers', ':', 'infinity', 'war', 'was', 'a', '2018', 'american', 'superhero', 'film', 'based', 'on', 'the', 'marvel', 'comics', 'superhero', 'team', 'the', 'avengers', '.', 'it', 'is', 'the', '19th', 'film', 'in', 'the', 'marvel', 'cinematic', 'universe', '(', 'mcu', ')', '.', 'the', 'running', 'time', 'of', 'the', 'movie', 'was', '149', 'minutes', 'and', 'the', 'box', 'office', 'collection', 'was', 'around', '2', 'billion', 'dollars', '.', '(', 'source', ':', 'wikipedia', ')']
The list of tokens generated above can be passed as an initialization argument for the 'Counter' class, which has already been imported at the beginning from the library module 'collections'.
The first line of code below creates a counter object, 'bagofwords_1', that allows us to see each token and the frequency. The second line prints the most common 10 tokens along with the frequency.
1bagofwords_1 = Counter(lowercase_tokens) 2print(bagofwords_1.most_common(10))
1[('the', 7), ('was', 3), ('.', 3), ('avengers', 2), (':', 2), ('superhero', 2), ('film', 2), ('marvel', 2), ('(', 2), (')', 2)]
The output generated above is interesting but not useful from topic identification purpose. This is because tokens like 'the' and 'was' are common words and do not help much in identifying the topics. To overcome this, we will do text preprocessing.
The first line of code below creates a list called 'alphabets' that loops over 'lowercase_tokens' and retains only alphabetical characters. The second and third lines remove the English stopwords, and the fourth line prints the new list called 'stopwords_removed'.
1alphabets = [t for t in lowercase_tokens if t.isalpha()] 2 3words = stopwords.words("english") 4stopwords_removed = [t for t in alphabets if t not in words] 5 6print(stopwords_removed)
1['avengers', 'infinity', 'war', 'american', 'superhero', 'film', 'based', 'marvel', 'comics', 'superhero', 'team', 'avengers', 'film', 'marvel', 'cinematic', 'universe', 'mcu', 'running', 'time', 'movie', 'minutes', 'box', 'office', 'collection', 'around', 'billion', 'dollars', 'source', 'wikipedia']
We have completed the initial text preprocessing steps, but more can still be done. One such important technique is Word Lemmatization, which is the process of shortening words to their roots or stems. This is done in the code below.
The first line of code instantiates the WordNetLemmatizer. The second line uses the '.lemmatize()' method to create a new list called lem_tokens, while the third line calls in the Counter class and creates a new Counter called bag_words. Finally, the fourth line prints the six most common tokens.
1lemmatizer = WordNetLemmatizer() 2 3lem_tokens = [lemmatizer.lemmatize(t) for t in stopwords_removed] 4 5bag_words = Counter(lem_tokens) 6print(bag_words.most_common(6))
1[('avenger', 2), ('superhero', 2), ('film', 2), ('marvel', 2), ('infinity', 1), ('war', 1)]
The above output is far more useful. We don't have stopwords like 'the' and 'was', and by looking at the new set of common words, we can easily identify that the topic of our text is Avengers.
We have seen how bag-of-words can be used after preprocessing to identify topics in a corpus. We will now learn about another powerful NLP library called 'genism' for topic modeling.
Gensim is an open source NLP library which can be used for creating and querying a corpus. It works by building word embeddings or vectors which are then used to perform topic modeling.
Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between terms in a corpus. For example, the distance between the two words 'India' and 'New Delhi' might be similar to the distance between 'China' and 'Beijing', as these represent the 'Country-Capital' vectors.
To get started, we have created nine sample documents taken from the Pluralsight website. These are represented as sample1 to sample9 in the lines of code below. Finally, we have created a collection of these documents in the last line of code.
1sample1 = "Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more." 2sample2 = "Our executives lead by example and guide us to accomplish great things every day." 3sample3 = "Working at Pluralisght means being surrounded by smart, passionate people who inspire us to do our best work." 4sample4 = "A leadership team with vision." 5sample5 = "Courses on cloud, microservices, machine learning, security, Agile and more." 6sample6 = "Interactive courses and projects." 7sample7 = "Personalized course recommendations from Iris." 8sample8 = "We’re excited to announce that Pluralsight has ranked #9 on the Great Place to Work 2018, Best Medium Workplaces list!" 9sample9 = "Few of the job opportunities include Implementation Consultant - Analytics, Manager - assessment production, Chief Information Officer, Director of Communications." 10 11# compile documents 12compileddoc = [sample1, sample2, sample3, sample4, sample5, sample6, sample7, sample8, sample9]
Let us examine the first document which can be done by the code below.
1Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more.
In subsequent sections of this guide, we will try to perform topic modeling on the corpus 'compileddoc'. As always, the first step is text preprocessing.
The first three lines of code below set the basic framework for cleaning the document. In the fourth to eight lines, we define a function for cleaning the document. Finally, in the last line of code, we use the function to create the cleaned document called 'final_doc'.
1stopwords = set(stopwords.words('english')) 2exclude = set(string.punctuation) 3lemma = WordNetLemmatizer() 4 5def clean(document): 6 stopwordremoval = " ".join([i for i in document.lower().split() if i not in stopwords]) 7 punctuationremoval = ''.join(ch for ch in stopwordremoval if ch not in exclude) 8 normalized = " ".join(lemma.lemmatize(word) for word in punctuationremoval.split()) 9 return normalized 10 11final_doc = [clean(document).split() for document in compileddoc]
Let us now look at the first document - pre and post text cleaning - with the following code.
1print("Before text-cleaning:", compileddoc) 2 3print("After text-cleaning:",final_doc)
1Before text-cleaning: Our board of directors boasts 11 seasoned technology and business leaders from Adobe, GSK, HGGC and more. 2After text-cleaning: ['board', 'director', 'boast', '11', 'seasoned', 'technology', 'business', 'leader', 'adobe', 'gsk', 'hggc', 'more']
We are now ready to carry out topic modeling on the 'final_doc' corpus, using a powerful statistical method called Latent Dirichlet Allocation (LDA). LDA uses a generative approach to find texts that are similar. It is not a classification technique and does not require labels to infer the patterns. Instead, the algorithm is more of an unsupervised method that uses a probabilistic model to identify groups of topics.
The first step is to convert the corpus into a matrix representation, as done in the following code.
The first line of code creates the term dictionary of the corpus, where every unique term is assigned an index. The second line converts the corpus into a Document-Term Matrix using dictionary prepared above. Finally, with the document-term matrix ready, we create the object for the LDA model in the third line of code.
1dictionary = corpora.Dictionary(final_doc) 2 3DT_matrix = [dictionary.doc2bow(doc) for doc in final_doc] 4 5Lda_object = gensim.models.ldamodel.LdaModel
After creating the LDA model object, we will train it on the document-term matrix. The first line of code below performs this task by passing the LDA object on the 'DT_matrix'. We also need to specify the number of topics and the dictionary. Since we have a small corpus of nine documents, we can limit the number of topics to two or three.
In the lines of code below, we have set the number of topics as 2. The second line prints the result.
1lda_model_1 = Lda_object(DT_matrix, num_topics=2, id2word = dictionary) 2 3print(lda_model_1.print_topics(num_topics=2, num_words=5))
1[(0, '0.042*"course" + 0.031*"more" + 0.022*"agile" + 0.022*"cloud" + 0.022*"microservices"'), (1, '0.026*"work" + 0.025*"great" + 0.025*"best" + 0.022*"director" + 0.021*"u"')]
In the output above, each line represents a topic with individual topic terms and term-weights. Topic1 seems to be more about the 'courses' offered by Pluralisght, while the second topic seems to indicate about 'work'.
We can also change the number of topics and see how it changes the output. In the following code, we have selected three topics.
1lda_model_2 = Lda_object(DT_matrix, num_topics=3, id2word = dictionary) 2 3print(lda_model_2.print_topics(num_topics=3, num_words=5))
1[(0, '0.030*"u" + 0.027*"great" + 0.024*"work" + 0.023*"director" + 0.022*"day"'), (1, '0.061*"course" + 0.025*"more" + 0.025*"learning" + 0.025*"cloud" + 0.025*"security"'), (2, '0.033*"best" + 0.029*"work" + 0.022*"smart" + 0.022*"working" + 0.022*"surrounded"')]
The result is almost the same, with Topic2 indicating 'courses', while Topics 1 and 3 seem to resemble 'work'.
In this guide, you have learned about topic identification using the bag-of-words technique. You also got an introduction on LDA using a powerful open source NLP library 'gensim'.
The performance of topic models is dependent on the terms present in the corpus, represented as document-term-matrix. Since this matrix is sparse in nature, reducing the dimensionality may improve the model performance. However, since our corpus was not very large, we can be reasonably confident with the achieved results.
To learn more about Natural Language Processing, please refer to the following guides: