Indexing Data in Elasticsearch

This course explains the index distribution architecture of Elasticsearch, cluster configuration, shards and replicas, similarity models, advanced search, and mixed-language documents, all of which improve the performance of search queries.
Course info
Rating
(20)
Level
Intermediate
Updated
Mar 22, 2018
Duration
2h 46m
Table of contents
Description
Course info
Rating
(20)
Level
Intermediate
Updated
Mar 22, 2018
Duration
2h 46m
Description

Getting Elasticsearch up and running is very simple, but tuning it to have low latency and high performance for search queries requires a deep understanding of the index distribution architecture. In this course, Indexing Data in Elasticsearch, you will understand the structure of distributed indices and advanced search constructs such as similarity models, segment merging, suggesters, fuzzy searches and working with mixed-language documents. First, you will study why shard overallocation is a good thing and how you can configure your cluster to avoid the split-brain scenario. Then, you will see how indices can be configured to use different similarity models and how to use force merging of segments to improve the performance of large indices. Next, you will explore how to cache prudently and use advanced search features. Finally, you will learn to deal with different languages in the same document with the ICU plugin. At the end of this course, you will have a deep understanding of how indexing works in Elasticsearch and be comfortable with advanced query constructs.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Analyzing Data with Qlik Sense
Intermediate
2h 11m
Jun 17, 2019
Using PyTorch in the Cloud: PyTorch Playbook
Intermediate
2h 21m
Apr 25, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi, and welcome to this course on Elasticsearch Indexing. I'll introduce myself first. I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on realtime collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high-quality video content. Getting Elasticsearch up and running is very simple. Tuning it to have low latency and high performance for search queries requires a deep understanding of the index distribution architecture. In this course, we'll study why shard overallocation is a good thing, and how you can configure your cluster to avoid this split-brain scenario. We'll then study how our indices can be configured to use different similarities models, which affect how our documents are scored. We'll see how we can use force merging of segments to improve the performance of large indices, which have been around a long time. We'll also study how we can use caching prudently to improve query performance. Elasticsearch offers a number of advanced search features such as word, phrase, and context suggesters, fuzzy searches, and autocomplete. We'll cover examples of all of these. This course covers the Elasticsearch functionality to deal with different languages in the same document. We'll specifically cover the install and use of the ICU plugin for Asian languages. At the end of this course, you should have a deep understanding of how indexing works in Elasticsearch, and be comfortable with advanced query constructs.

Introducing the Index Distribution Architecture
If you've implemented search in your product, chances are you used Elasticsearch. Elasticsearch is a schemaless, flexible, and very powerful search technology that is very popular today. Elasticsearch works extremely well out of the box, it's very easy to get up and running, but in order to get the best search results possible, you need to tune your search further, and that's where this course will take you. We'll talk about how we can index data in Elasticsearch. Elasticsearch is fast because it runs on a distributed system. We'll start off with distributed cluster configuration, and see how we can configure the master and data nodes to avoid something called a split-brain scenario. There, inputted communications breakdown within the cluster, multiple nodes believe that they are the master. We'll see how Elasticsearch speeds up query performance and throughput by executing queries in parallel on shards and replicas. Search latency and performance can be further improved by searching only a subset of the underlying data. This can be done by indexing your documents in specific shards, and only routing search queries to those shards. This is possible when you where exactly the document is available. We'll also see how you can specify all query preferences to determine where the query should be executed on the master node, on the local node, on the primary shard, and so on.

Executing Low-level Index Control
Hi, and welcome to this module where we'll see how we can execute more granular and lower-level control over the documents that live in our indices. Specifically, we'll see how we can control the relevance and scoring of documents that match our search query by specifying similarity models. Elasticsearch makes available to us a number of built-in similarity models, which we can then customize further by tweaking their parameters. For a large corpus of search data, which is constantly being updated, you'll find that over time your indices get very large and harder to manage. Your search performance also suffers. We'll see how you can control your index size, and expunge deleted documents by merging low-level Apache Lucene segments. We'll also dive into the details of how caching works in Elasticsearch, and how you can speed up your queries by using request caching, as well as query caching.

Improving the User Search Experience
Hi, and welcome to this module where we'll focus on Improving the User Search Experience with a variety of different features that Elasticsearch offers. We'll first get an understanding of how term queries and match queries behave in Elasticsearch, and see how we can configure case-insensitive searches for term queries. Elasticsearch can also be configured as a suggester, where it offers suggestions for terms, phrases, autocomplete, and suggestions which have context as well. We'll also see how we can configure our queries for fuzzy searches, where we want to find the similar terms, which are not an exact match. We'll also see how we can use Elasticsearch's autocomplete functionality. Autocomplete will allow us to match words even with a partial specification in our search query.

Dealing with Human Languages
Hi, and welcome to this module where we'll see how we can work with documents in different languages that have been added to our search index. Depending on the location of the user who has performed the search, you might want to boost search results in certain languages. We'll first cover how you can accomplish that. We'll then see how you can work with mixed-language documents, where the entire document may not be in one specific language, the contents of different fields might be in different languages. You can set language analyzers on a per field basis to deal with such situations. It's also possible for the same field across documents to hold content in different languages. Elasticsearch allows us to specify multiple languages in the same field. And finally, we'll see how you can install and use the ICU plugin for better analysis of Asian languages, such as Chinese, Japanese, Korean, etc.