Building Features from Text Data

by Janani Ravi

This course covers aspects of extracting information from text documents and constructing classification models including feature vectorization, locality-sensitive hashing, stopword removal, lemmatization, and more from natural language processing.

What you'll learn

From chatbots to machine-generated literature, some of the hottest applications of ML and AI these days are for data in textual form.

In this course, Building Features from Text Data, you will gain the ability to structure textual data in a manner ideal for use in ML models.

First, you will learn how to represent documents as feature vectors using one-hot encoding, frequency-based, and prediction-based techniques. You will see how to improve these representations based on the meaning, or semantics, of the document.

Next, you will discover how to leverage various language modeling features such as stopword removal, frequency filtering, stemming and lemmatization, and parts-of-speech tagging.

Finally, you will see how locality-sensitive hashing can be used to reduce the dimensionality of documents while still keeping similar documents close together.

You will round out the course by implementing a classification model on text documents using many of these modeling abstractions.

When you’re finished with this course, you will have the skills and knowledge to use documents and textual data in conceptually and practically sound ways and represent such data for use in machine learning models.

Table of contents

Course Overview
2mins

About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

Ready to upskill? Get started