Author avatar

Deepika Singh

Named Entity Recognition (NER)

Deepika Singh

  • Jul 10, 2019
  • 9 Min read
  • 4,496 Views
  • Jul 10, 2019
  • 9 Min read
  • 4,496 Views
Data
Python

Introduction

In this guide, you will learn about an advanced Natural Language Processing technique called Named Entity Recognition, or 'NER'.

NER is an NLP task used to identify important named entities in the text such as people, places, organizations, date, or any other category. It can be used alone, or alongside topic identification, and adds a lot of semantic knowledge to the content, enabling us to understand the subject of any given text.

Let us start with loading the required libraries and modules.

Loading the Required Libraries and Modules

1import nltk
2from nltk.tokenize import word_tokenize
3from collections import Counter
4nltk.download('wordnet')  #download if using this module for the first time
5from nltk.stem import WordNetLemmatizer
6from nltk.corpus import stopwords
7nltk.download('stopwords')    #download if using this module for the first time
8from nltk.tokenize import word_tokenize
9from nltk import word_tokenize, pos_tag, ne_chunk
10nltk.download('words')
11nltk.download('averaged_perceptron_tagger')
12nltk.download('punkt')
13nltk.download('maxent_ne_chunker')
python

We will be using the following text for this guide:

1textexample = "Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia)."
2print(textexample)
python

Output:

1Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia).

Word Tokenization

The first step is to tokenize the text into sentences which is done in the first line of code below. The second line performs word tokenization on the sentences, while the third line prints the tokenized sentence.

1sentences = nltk.sent_tokenize(textexample)
2tokenized_sentence = [nltk.word_tokenize(sent) for sent in sentences]
3tokenized_sentence 
python

Output:

1    [['Avengers',
2      ':',
3      'Endgame',
4      'is',
5      'a',
6      '2019',
7      'American',
8      'superhero',
9      'film',
10      'based',
11      'on',
12      'the',
13      'Marvel',
14      'Comics',
15      'superhero',
16      'team',
17      'the',
18      'Avengers',
19      ',',
20      'produced',
21      'by',
22      'Marvel',
23      'Studios',
24      'and',
25      'distributed',
26      'by',
27      'Walt',
28      'Disney',
29      'Studios',
30      'Motion',
31      'Pictures',
32      '.'],
33     ['The',
34      'movie',
35      'features',
36      'an',
37      'ensemble',
38      'cast',
39      'including',
40      'Robert',
41      'Downey',
42      'Jr.',
43      ',',
44      'Chris',
45      'Evans',
46      ',',
47      'Mark',
48      'Ruffalo',
49      ',',
50      'Chris',
51      'Hemsworth',
52      ',',
53      'and',
54      'others',
55      '.'],
56     ['(', 'Source', ':', 'wikipedia', ')', '.']]

Parts of Speech (POS) Tagging

Parts-of-speech tagging, also called grammatical tagging, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. The line of code below takes the tokenized text and passes it to the 'nltk.pos_tag' function to create its POS tagging.

1pos_tagging_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentence]
python

Let us combine these two steps into a function and analyse the output. The first to fourth lines of code below creates the function to tokenize the text and perform POS tagging. The fifth line of code runs the function to our text, while the sixth line prints the output.

1def preprocess(text):
2    text = nltk.word_tokenize(text)
3    text = nltk.pos_tag(text)
4    return text
5
6processed_text = preprocess(textexample)
7processed_text
python

Output:

1    [('Avengers', 'NNS'),
2     (':', ':'),
3     ('Endgame', 'NN'),
4     ('is', 'VBZ'),
5     ('a', 'DT'),
6     ('2019', 'JJ'),
7     ('American', 'JJ'),
8     ('superhero', 'NN'),
9     ('film', 'NN'),
10     ('based', 'VBN'),
11     ('on', 'IN'),
12     ('the', 'DT'),
13     ('Marvel', 'NNP'),
14     ('Comics', 'NNP'),
15     ('superhero', 'NN'),
16     ('team', 'NN'),
17     ('the', 'DT'),
18     ('Avengers', 'NNPS'),
19     (',', ','),
20     ('produced', 'VBN'),
21     ('by', 'IN'),
22     ('Marvel', 'NNP'),
23     ('Studios', 'NNP'),
24     ('and', 'CC'),
25     ('distributed', 'VBN'),
26     ('by', 'IN'),
27     ('Walt', 'NNP'),
28     ('Disney', 'NNP'),
29     ('Studios', 'NNP'),
30     ('Motion', 'NNP'),
31     ('Pictures', 'NNP'),
32     ('.', '.'),
33     ('The', 'DT'),
34     ('movie', 'NN'),
35     ('features', 'VBZ'),
36     ('an', 'DT'),
37     ('ensemble', 'JJ'),
38     ('cast', 'NN'),
39     ('including', 'VBG'),
40     ('Robert', 'NNP'),
41     ('Downey', 'NNP'),
42     ('Jr.', 'NNP'),
43     (',', ','),
44     ('Chris', 'NNP'),
45     ('Evans', 'NNP'),
46     (',', ','),
47     ('Mark', 'NNP'),
48     ('Ruffalo', 'NNP'),
49     (',', ','),
50     ('Chris', 'NNP'),
51     ('Hemsworth', 'NNP'),
52     (',', ','),
53     ('and', 'CC'),
54     ('others', 'NNS'),
55     ('.', '.'),
56     ('(', '('),
57     ('Source', 'NN'),
58     (':', ':'),
59     ('wikipedia', 'NN'),
60     (')', ')'),
61     ('.', '.')]

The output above shows that every token has been tagged to its parts of speech. Some of the common abbreviations are explained below:

  • DT: determiner
  • IN: preposition/subordinating conjunction
  • JJ: adjective ‘big’
  • JJR: adjective, comparative ‘bigger’
  • JJS: adjective, superlative ‘biggest’
  • LS: list marker
  • NN: noun, singular ‘desk’
  • NNS: noun plural ‘desks’
  • NNP: proper noun, singular ‘Harrison’
  • NNPS: proper noun, plural ‘Americans’
  • PRP: personal pronoun I, he, she
  • RB: adverb very, silently,
  • UH: interjection
  • VB: verb, base form take
  • VBD: verb, past tense took

Chunking

Once we have completed the parts-of-speech tagging, we will perform chunking. In simple terms, what chunking does is that it adds more structure to the sentence over and above the tagging. The output results in grouping of words called 'chunks'.

We will perform chunking to the processed text which is done in the first line of code below. The second to fourth lines of code does the chunking, and in our example, we will only look at Nouns for the NER tagging.

1res_chunk = ne_chunk(processed_text)
2
3for x in str(res_chunk).split('\n'):
4    if '/NN' in x:
5        print(x)
python

Output:

1      Avengers/NNS
2      Endgame/NN
3      superhero/NN
4      film/NN
5      (ORGANIZATION Marvel/NNP Comics/NNP)
6      superhero/NN
7      team/NN
8      (ORGANIZATION Avengers/NNPS)
9      (PERSON Marvel/NNP Studios/NNP)
10      (PERSON Walt/NNP Disney/NNP Studios/NNP)
11      Motion/NNP
12      Pictures/NNP
13      movie/NN
14      cast/NN
15      (PERSON Robert/NNP Downey/NNP Jr./NNP)
16      (PERSON Chris/NNP Evans/NNP)
17      (PERSON Mark/NNP Ruffalo/NNP)
18      (PERSON Chris/NNP Hemsworth/NNP)
19      others/NNS
20      (PERSON Source/NN)
21      wikipedia/NN

Let us explore the above output. We observe that the word tokens 'Endgame', 'film', and 'Source' are tagged as singular noun 'NN', while tokens like 'Avengers' and 'others' are tagged as plural noun 'NNS. Also, note that the names of the actors 'Robert', 'Evans', etc., have been tagged as proper noun 'NNP'.

Conclusion

In this guide, you have learned about how to perform Named Entity Recognition using nltk. You learned about the three important stages of Word Tokenization, POS Tagging, and Chunking that are needed to perform NER analysis.

To learn more about Natural Language Processing with Python, please refer to the following guides: