A classifier classifies data for us —“this email is spam” or “this text is in English,” for example. To have a good classification, we need to train our classifier. We’ll have some data where we already know the classification, which in machine learning is called a label. The classifier will then use this data to infer the labels of new data — this is what’s called supervised learning. Considering our language classifier, we can train it using phrases in different languages and tell it the language of each phrase. To create a spam filter, we can take emails from our mailbox and for each email, we can tell the classifier which one is spam and which is ham (not spam). In supervised learning, the quality of the training data is probably the most important thing. And the better training data we have, the better accuracy we can get from the classifier.
Text Classifier: Naive Bayes
The Naive Bayes classifier is a simple but powerful classifier, and it’s used a lot to classify text. The two examples above, language and spam detection, can be done using the Naive Bayes classifier. It’s based on Bayes’ theorem, which I won’t go deep into, but it is worth touching on why it’s called Naive. This classifier is based on a quite strong assumption called naive assumption — every word is independent from the other, and the order of words is not considered at all. This assumption is wrong, especially if we classify text. In the phrase, “Michael is looking for John,” the term “for” is quite conditionally dependent to “is looking” and, changing the order of the words, “John is looking for Michael” has a different meaning. It’s for that reason the Naive Bayes classifier is trained and classifies the two phrases exactly in the same way, because it checks the frequency of the words and doesn’t consider the position of them. This wrong assumption makes the Naive Bayes classifiers easy to develop, and despite this assumption, they classify the text quite well.
NB Language Classifier
So let’s start to build a language classifier using a Naive Bayes classifier starting at the beginning with just two languages, French and Italian. Once we start with these basics and you get the hang of how it works, we can then easily extend the classifier to other languages. Let’s consider two phrases per language:
Just checking the word’s frequencies reveals it’s pretty obvious this is Italian. So, what have we done here? We are classifying the phrase based on the training phrases where we already know the language associations. By considering the frequency of words, we can infer the language — and this is exactly what the Naive Bayes classifier does. Now, let’s try a different phrase:
Although we’ve started with an Italian greeting, we can see that the language here is mainly French.
I’ve also written a simple language classifier with some training data. The main code is under language_classifier.js, which is simply a wrapper of the classifier library. To try it, just download it, install the dependencies — npm install — and run it — npm run classify “the text we want to classify.” It doesn’t save the training, so every time you run it, it will do the training from the beginning. The training persistency could be added using redis and the classifier library. NOTE: I’ve included 20 phrases per language so if you want to use it in the real world, you should enrich the training data training_data/language.txt. So now let’s see how with just a few lines of code and the leverage of a library, we can easily code our own language detection app.
Now that we have a classifier instance, let’s start training it with some French and Italian phrases. I’ve used this site to get random phrases in different languages, but you can also find the training data on the GitHub repo from the project above. You can also try adding new phrases in different languages from newspaper websites like lemonde.fr or ansa.it, and pull phrases from the articles.
After going through some training phrases — at least 20 to 30 per language to ensure decent accuracy — we’re ready to classify our first phrase. We just need to pass the text we want to classify to the “classify” method:
And it returns the label it associates to the phrase based on the training, which should be “italian.” One thing to remember is to be sure you go through good training data — if you train the classifier with bad training data, you’ll get an inaccurate classification.
With the sample code and the instruction you’ve been given so far, try to classify the Italian phrase, “non so quando andare al cinema” by running:
Does it classify your text as French? Not good — that’s not what we want. Now try to replace “al” with “il”:
It classifies the text correctly, even though this isn’t accurate — why? It’s all thanks to stop words, which I’ll dig into more in a later blog post, but they’re basically high frequency words like “the,” “it,” “is,” “not,” etc. These stop words are all the words we need to filter out before we train or classify our text — you can check out a list of English stop words here. The language classifier is a great place to start learning because each language has a different vocabulary, so it’s easy to infer the language based on word frequency. To take it up a notch, creating a spam detector gets trickier because all the text — both spam and ham — will be in the same language. In my next blog post, we’ll go a bit deeper into the Naive Bayes classifier and learn how to better train classifiers using stop words and stemmers — everything you need to build a spam filter. Be sure to work through the examples I gave above, and if you run into any issues, something doesn’t work, or you have some strange results, please drop a comment below!