Building Features from Text Data

This course covers aspects of extracting information from text documents and constructing classification models including feature vectorization, locality-sensitive hashing, stopword removal, lemmatization, and more from natural language processing.
Course info
Level
Advanced
Updated
Jun 28, 2019
Duration
2h 35m
Table of contents
Course Overview
Representing Text as Features for Machine Learning
Building Feature Vector Representations of Text
Simplifying Text Processing Using Natural Language Processing
Reducing Dimensions in Text Using Hashing
Applying Text Feature Extraction Techniques to Machine Learning
Description
Course info
Level
Advanced
Updated
Jun 28, 2019
Duration
2h 35m
Description

From chatbots to machine-generated literature, some of the hottest applications of ML and AI these days are for data in textual form.

In this course, Building Features from Text Data, you will gain the ability to structure textual data in a manner ideal for use in ML models.

First, you will learn how to represent documents as feature vectors using one-hot encoding, frequency-based, and prediction-based techniques. You will see how to improve these representations based on the meaning, or semantics, of the document.

Next, you will discover how to leverage various language modeling features such as stopword removal, frequency filtering, stemming and lemmatization, and parts-of-speech tagging.

Finally, you will see how locality-sensitive hashing can be used to reduce the dimensionality of documents while still keeping similar documents close together.

You will round out the course by implementing a classification model on text documents using many of these modeling abstractions.

When you’re finished with this course, you will have the skills and knowledge to use documents and textual data in conceptually and practically sound ways and represent such data for use in machine learning models.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Predictive Analytics with PyTorch
Intermediate
2h 31m
May 1, 2020
Implementing Bootstrap Methods in R
Advanced
2h 10m
May 1, 2020
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
(Music) Hi. My name is Janani Ravi, and welcome to this course on Building Features from Text Data. A little about myself. I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real-time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high-quality video content. From chat bots to machine-generated literature some of the hottest applications of ML and AI these days are to data in text form. In this course, you will gain the ability to structure textual data in a manner ideal for use in ML models. First, you will learn how to represent documents as feature vectors using one one-hot encoding, frequency-based, and prediction-based techniques. You will see how to improve these representations based on the meaning or semantics of the document. Next, you will discover how to leverage various language modeling features such as stop word removal, frequency filtering, stemming, and lemmatization and parts of speech tagging. You will then see how locality-sensitive hashing can be used to reduce the dimensionality of documents while still keeping similar documents close together. You will round out the course by implementing a classification model on text documents using many of these modeling abstractions. When you're finished with this course, you will have the skills and knowledge to move on to use documents and textual data in conceptually and practically sound ways and represent such data for use in machine learning models.