In this guide, you will learn about an advanced Natural Language Processing technique called Named Entity Recognition, or 'NER'.
NER is an NLP task used to identify important named entities in the text such as people, places, organizations, date, or any other category. It can be used alone, or alongside topic identification, and adds a lot of semantic knowledge to the content, enabling us to understand the subject of any given text.
Let us start with loading the required libraries and modules.
1import nltk 2from nltk.tokenize import word_tokenize 3from collections import Counter 4nltk.download('wordnet') #download if using this module for the first time 5from nltk.stem import WordNetLemmatizer 6from nltk.corpus import stopwords 7nltk.download('stopwords') #download if using this module for the first time 8from nltk.tokenize import word_tokenize 9from nltk import word_tokenize, pos_tag, ne_chunk 10nltk.download('words') 11nltk.download('averaged_perceptron_tagger') 12nltk.download('punkt') 13nltk.download('maxent_ne_chunker')
We will be using the following text for this guide:
1textexample = "Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia)." 2print(textexample)
1Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia).
The first step is to tokenize the text into sentences which is done in the first line of code below. The second line performs word tokenization on the sentences, while the third line prints the tokenized sentence.
1sentences = nltk.sent_tokenize(textexample) 2tokenized_sentence = [nltk.word_tokenize(sent) for sent in sentences] 3tokenized_sentence
1 [['Avengers', 2 ':', 3 'Endgame', 4 'is', 5 'a', 6 '2019', 7 'American', 8 'superhero', 9 'film', 10 'based', 11 'on', 12 'the', 13 'Marvel', 14 'Comics', 15 'superhero', 16 'team', 17 'the', 18 'Avengers', 19 ',', 20 'produced', 21 'by', 22 'Marvel', 23 'Studios', 24 'and', 25 'distributed', 26 'by', 27 'Walt', 28 'Disney', 29 'Studios', 30 'Motion', 31 'Pictures', 32 '.'], 33 ['The', 34 'movie', 35 'features', 36 'an', 37 'ensemble', 38 'cast', 39 'including', 40 'Robert', 41 'Downey', 42 'Jr.', 43 ',', 44 'Chris', 45 'Evans', 46 ',', 47 'Mark', 48 'Ruffalo', 49 ',', 50 'Chris', 51 'Hemsworth', 52 ',', 53 'and', 54 'others', 55 '.'], 56 ['(', 'Source', ':', 'wikipedia', ')', '.']]
Parts-of-speech tagging, also called grammatical tagging, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. The line of code below takes the tokenized text and passes it to the 'nltk.pos_tag' function to create its POS tagging.
1pos_tagging_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentence]
Let us combine these two steps into a function and analyse the output. The first to fourth lines of code below creates the function to tokenize the text and perform POS tagging. The fifth line of code runs the function to our text, while the sixth line prints the output.
1def preprocess(text): 2 text = nltk.word_tokenize(text) 3 text = nltk.pos_tag(text) 4 return text 5 6processed_text = preprocess(textexample) 7processed_text
1 [('Avengers', 'NNS'), 2 (':', ':'), 3 ('Endgame', 'NN'), 4 ('is', 'VBZ'), 5 ('a', 'DT'), 6 ('2019', 'JJ'), 7 ('American', 'JJ'), 8 ('superhero', 'NN'), 9 ('film', 'NN'), 10 ('based', 'VBN'), 11 ('on', 'IN'), 12 ('the', 'DT'), 13 ('Marvel', 'NNP'), 14 ('Comics', 'NNP'), 15 ('superhero', 'NN'), 16 ('team', 'NN'), 17 ('the', 'DT'), 18 ('Avengers', 'NNPS'), 19 (',', ','), 20 ('produced', 'VBN'), 21 ('by', 'IN'), 22 ('Marvel', 'NNP'), 23 ('Studios', 'NNP'), 24 ('and', 'CC'), 25 ('distributed', 'VBN'), 26 ('by', 'IN'), 27 ('Walt', 'NNP'), 28 ('Disney', 'NNP'), 29 ('Studios', 'NNP'), 30 ('Motion', 'NNP'), 31 ('Pictures', 'NNP'), 32 ('.', '.'), 33 ('The', 'DT'), 34 ('movie', 'NN'), 35 ('features', 'VBZ'), 36 ('an', 'DT'), 37 ('ensemble', 'JJ'), 38 ('cast', 'NN'), 39 ('including', 'VBG'), 40 ('Robert', 'NNP'), 41 ('Downey', 'NNP'), 42 ('Jr.', 'NNP'), 43 (',', ','), 44 ('Chris', 'NNP'), 45 ('Evans', 'NNP'), 46 (',', ','), 47 ('Mark', 'NNP'), 48 ('Ruffalo', 'NNP'), 49 (',', ','), 50 ('Chris', 'NNP'), 51 ('Hemsworth', 'NNP'), 52 (',', ','), 53 ('and', 'CC'), 54 ('others', 'NNS'), 55 ('.', '.'), 56 ('(', '('), 57 ('Source', 'NN'), 58 (':', ':'), 59 ('wikipedia', 'NN'), 60 (')', ')'), 61 ('.', '.')]
The output above shows that every token has been tagged to its parts of speech. Some of the common abbreviations are explained below:
Once we have completed the parts-of-speech tagging, we will perform chunking. In simple terms, what chunking does is that it adds more structure to the sentence over and above the tagging. The output results in grouping of words called 'chunks'.
We will perform chunking to the processed text which is done in the first line of code below. The second to fourth lines of code does the chunking, and in our example, we will only look at Nouns for the NER tagging.
1res_chunk = ne_chunk(processed_text) 2 3for x in str(res_chunk).split('\n'): 4 if '/NN' in x: 5 print(x)
1 Avengers/NNS 2 Endgame/NN 3 superhero/NN 4 film/NN 5 (ORGANIZATION Marvel/NNP Comics/NNP) 6 superhero/NN 7 team/NN 8 (ORGANIZATION Avengers/NNPS) 9 (PERSON Marvel/NNP Studios/NNP) 10 (PERSON Walt/NNP Disney/NNP Studios/NNP) 11 Motion/NNP 12 Pictures/NNP 13 movie/NN 14 cast/NN 15 (PERSON Robert/NNP Downey/NNP Jr./NNP) 16 (PERSON Chris/NNP Evans/NNP) 17 (PERSON Mark/NNP Ruffalo/NNP) 18 (PERSON Chris/NNP Hemsworth/NNP) 19 others/NNS 20 (PERSON Source/NN) 21 wikipedia/NN
Let us explore the above output. We observe that the word tokens 'Endgame', 'film', and 'Source' are tagged as singular noun 'NN', while tokens like 'Avengers' and 'others' are tagged as plural noun 'NNS. Also, note that the names of the actors 'Robert', 'Evans', etc., have been tagged as proper noun 'NNP'.
In this guide, you have learned about how to perform Named Entity Recognition using nltk. You learned about the three important stages of Word Tokenization, POS Tagging, and Chunking that are needed to perform NER analysis.
To learn more about Natural Language Processing with Python, please refer to the following guides: