Elasticsearch Text Analysis: How to Use Analyzers and Normalizers
July 7, 2021
Elasticsearch is a distributed search and analytics engine used for real-time data processing of several different data types. Elasticsearch has built-in processing for numerical, geospatial, and structured text values.
Unstructured text values have some built-in analytics capabilities, but custom text fields generally require custom analysis. Built-in text analysis uses analyzers provided by Elasticsearch, but customization is also possible.
Elasticsearch uses text analysis to convert unstructured text data into a searchable format. Analyzers and normalizers can be user-configurable to ensure users get expected search results for custom, unstructured text fields.
What is text analysis, and why is it important?
Text analysis provided by Elasticsearch makes large data sets digestible and searchable. Search engines use text analysis to match your query to thousands of web pages that might meet your needs.
Typical cases of text search in Elasticsearch include building a search engine, tracking event data and metrics, visualizing text responses, and log analytics. Each of these could require unstructured text fields to be gathered logically to form the final data used.
Elasticsearch has become a go-to tool for log storage since microservices have become popular. It can serve as a central location to store logs from different functions so that an entire system can be analyzed together.
You can also plug Elasticsearch into other tools that already have metrics and analytics set up to skip the need to do further analysis with your Elasticsearch query results. Coralogix provides an Elastic API to ingest Elasticsearch data for this purpose.
What is an Elasticsearch Analyzer?
An analyzer in Elasticsearch uses three parts: a character filter, a tokenizer, and a token filter. All three together can configure a text field into a searchable format. The text values can be single words, emails, or program logs.
A character filter will take the original text value and look at each character. It can add, remove or change characters in the string. These changes could be useful if you need to change characters between languages that have different alphabets.
Analyzers do not require character filters. You may also want to have multiple analyzers for your text which is allowed. Elasticsearch applies all character filters available in the order you specify.
A token is a unit of text which is then used in searches. A tokenizer will take a stream of continuous text and break it up into tokens. Tokenizers will also track the order and position of each term in the text, start and end character offsets, and token type.
Position tracking is helpful for word proximity queries, and character offsets are used for highlighting. Token types indicate the data type of the token (alphanumeric, numerical, etc.).
Elasticsearch provides many built-in tokenizers. These include different ways to split phrases into tokens, partial words, and keywords or patterns. See Elasticsearch’s web page for a complete list.
Elasticsearch analyzers are required to use a tokenizer. Each analyzer may have only a single tokenizer.
A token filter will take a stream of tokens from the tokenizer output. It will then modify the tokens in some specific way. For example, the token filter might lowercase all the letters in a token, delete tokens specified in the settings, or even add new tokens based on the existing patterns or tokens. See Elasticsearch’s web page for a complete list of built-in token filters.
Analyzers do not require token filters. You could have no token filters or many token filters that provide different functionality.
What is an Elasticsearch Normalizer?
A normalizer works similarly to analyzers in that it tokenizes your text. The key difference is that normalizers can only emit a single token while analyzers can emit many. Since they only emit one token, normalizers do not use a tokenizer. They do use character filters and token filters but are limited to using those that work at a single character at a time.
What happens by default?
Elasticsearch applies no normalizer by default. These can only be applied to your data by adding them to your mapping before creating your index.
Elasticsearch will apply the standard analyzer by default to all text fields. The standard analyzer uses grammar-based tokenization. Words in your text field are split wherever a word boundary, such as space or carriage return, occurs.
Example of Elasticsearch Analyzers and Normalizers
Elasticsearch provides a convenient API to use in testing out your analyzers and normalizers. You can use this to quickly iterate through examples to find the right settings for your use case.
Inputs set up custom analyzers or normalizers, or you can use the current index configurations to test them out. Some examples of analyzers and normalizers are provided below to show what can be done in Elasticsearch. Outputs show what tokens are created with the current settings.
Use the HTML Strip character filter to convert HTML text into plain text. You can optionally combine this with the lowercase token filter to not consider the casing in your search. Note that normalizers only output a single token using the filter and char_filter settings.
Use the built-in email tokenizer to create a token with type <EMAIL>. This can be useful when searching large bodies of text for emails. The following example shows an email token is present when only an email is used in the text. However, the same token is found when the email input is surrounded by other text. In that case, multiple tokens are returned, one of them being the email.
An Email analyzer can also be made with a custom filter. The filter in the following example will split your data on specific characters common to email addresses. This principle can be applied to any data with common characters or patterns (like data logs). A token is created for each section of text divided by the regex characters described in the filter.
Elasticsearch analyzers and normalizers are used to convert text into tokens that can be searched. Analyzers use a tokenizer to produce one or more tokens per text field. Normalizers use only character filters and token filters to produce a single token. Tokens produced can be controlled by setting requirements in your Elasticsearch mapping.
Elasticsearch indices become readily searchable when analyzers are appropriately used. A common use-case for Elasticsearch is to store and analyze console logs. Elasticsearch indices can be imported by tools like Coralogix’s log analytics platform. This platform can take logs stored in your index and convert them into user-readable and actionable insights that allow you to support your system better.
In recent years, microservices have emerged as a popular architectural pattern. Although these self-contained services offer greater flexibility, scalability, and maintainability compared to monolithic applications, they…