[Live Webinar] Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy Register today!

How to Optimize Your Elasticsearch Queries Using Pagination

  • Joanna Boetzkes
  • March 24, 2021
Share article
elasticsearch pagination

Consider for a moment that you are building a webpage that displays data stored in Elasticsearch. You have so much information in your index that your API Gateway cannot handle it all at once. What you’ll need to do is paginate your results so that the client can have a predictable amount of data returned each time.

Before paginating your results with your client, you will need to know how to paginate data in your backend storage. Most data storage solutions include functions enabling users to sort, filter, and paginate data. Elasticsearch is no different.

Your requirements and data structure will be crucial in deciding which method best suits your needs. Elasticsearch provides three ways of paginating data that are each useful: 

  • From/Size Pagination
  • Search After Pagination
  • Scroll Pagination

Let’s look at how these different types of pagination work:

From/Size Pagination

This Elasticsearch query shows how you can get the third page of data using from/size pagination. Assuming all the pages are 25 documents long, this search will return documents starting at the 50 and going to the 75.

The example below uses Elasticsearch’s search API that will look at all documents in the index. You could also filter or sort the records to keep the results consistent for viewing.

GET http://localhost:XXXX/your-index-name/_search

{

    "size": 25,

    "from": 50,

    "query": {

        "match_all": {}

    }

}

 

The Simplest to Implement

The simplest method of pagination uses the from and size parameters available in Elasticsearch’s search API. By default, from is 0 and size is 10, meaning if you don’t specify otherwise, Elasticsearch will return only the first ten results from your index. 

Change the from and size input values to get different pages of data. The from variable represents which document will start the page, and the size variable describes how many documents your search will return.

Great for Small Data Sets

If you have a large data set (more than 10,000 documents), from/size pagination is not ideal for you. You can use this up to 10,000 records without changes, and you can also increase this window to a higher number if you choose.

However, this value is used as a safeguard to protect against degraded performance or even failures. Elasticsearch must load all the data from the requested documents and any documents from previous pages behind the scenes. These documents can span multiple shards in your Elasticsearch cluster as well. As you get deeper into your data set, the operations must grow in size, causing issues.

May Miss Returning Documents

If you have a constantly or unpredictably changing dataset, you do not want to use this Elasticsearch pagination method to return all data. When you update a document, Elasticsearch will reindex it, potentially causing a change in documents’ order.

The addition or removal of a document will also change which hits are adjacent in the index. Since from/size pagination relies on documents’ location in an index, the reordering will mean returning duplicates of some documents and missing others. If you are displaying data on your webpage, the pages won’t always contain the same data, leading to a poor user experience.

Scroll Pagination

To perform scroll pagination, you must first perform a search using Elasticsearch’s search API. The first search includes a parameter indicating that a scroll will take place. The results will include a ‘scroll_id’ along with the results of the search request. 

POST http://localhost:9200/test-index-v1/_search?scroll=5m

{

    "size": 25,

    "query": {

        "match_all": {}

    }

}


Send subsequent calls to a different Elasticsearch API called scroll. The input only includes the scroll_id and an optional time to keep the scroll index alive. This scroll will get the next page of data and return new scroll_id values as long as there are more pages to collect. 

POST http://localhost:9200/_search/scroll

{

    "scroll": "5m",

    "Scroll_id":"scroll_id string returned from search"

}


Preserves the Index State While You Search

The benefit to using scroll over From/size pagination is scroll’s ability to freeze or preserve the index in time for your search. The scroll API’s response includes a scroll_id. Subsequent calls to the search API use that identifier, so Elasticsearch returns only document versions existing when the scroll initialization occurred.

Freezing the index version fixes the issue seen in from/size pagination that can cause developers to miss data in their index.

Uses Significant Memory Resources

Background processes in Elasticsearch will merge smaller segments into larger ones to keep the index number relatively small. Elasticsearch deletes smaller segments after merging, and once the smaller segments are not needed. Open scrolls will block the smaller segments’ deletion since they are used to return the scroll data. Developers need to ensure they have adequate free file handles and disk space to support using the scroll API.

Elasticsearch must track which documents exist to return your data’s correct version when you use the scroll API. Saved versions are stored in the search context and need heap space. Elasticsearch sets a maximum number of open scrolls to prevent issues from arising with too many open scrolls. However, you will still need to ensure you manually close scrolls or allow them to timeout and delete automatically to preserve heap space. 

Get All Documents From an Extensive Index

If you are doing some processing that requires you to get each document, the scroll API is an acceptable option. You will be able to loop through each document in your index and have confidence Elasticsearch will return all the existing documents.

Scroll pagination does not limit the number of documents like from/size pagination so that you can get more than the 10,000 document limit. Ensure you set a short timeout for your scroll, or manually remove it after your processing is complete to avoid memory issues.

Good Choice for AWS Elasticsearch Users

AWS’s integration with Elasticsearch only supports up to version 7.9. The now-recommended search_after using point-in-time integrations is a new feature available on Elasticsearch version 7.10.

Since the recommended Elasticsearch pagination method is not available, AWS users should use scroll instead since it was the recommended method for lower versions.

search_after Pagination

Two versions of search_after pagination exist. Both have similar requirements for the body of the search: the sort input is required. When you use a sort array in the input to the search API, the results will contain a sort value. This value should be used in the search_after value of the subsequent query to get the next page of results.

The search below shows how to get the second group of 25 hits in an index assuming an integer value testInt exists and is incremented by one in each document.

POST http://localhost:9200/test-index-v1/_search

{

    "size": 25,

    "query": {

        "match_all": {}

    },

    "sort": [

        {

            "testInt": "asc"

        }

    ],

    "search_after": [25]

}

 

Like with from/size pagination, if the index is refreshed during your search, the order of documents may change, leading to inconsistent viewing of data and possibly skipping over documents. New in Elasticsearch version 7.10 is a point in time (PIT) API to avoid this sissue.

To use it, first request a PIT using the command below. Then, add the returned PIT to your query. 

 

POST http://localhost:9200/test-index-v1/_pit?keep_alive=1m  

 

Use search_after Alone to Paginate Deeply

Use the search API with a sort input to paginate through indices, including those with more than 10,000 records. Use the sort response from the last hit as the search_after input to the next search API call. Elasticsearch will use the search_after input to find the following document in the index and return it first on the next page. 

Elasticsearch does not freeze the index with this command. If your data is changing during your search, you may miss documents as you paginate through your index, or your pages will not contain consistent data.

Can Preserve the Index State Using Point In Time API

The point in time (PIT) API is available as of ELK version 7.10. By default, search requests will execute against the most recent data in your index. The PIT API will create a view of the index at a given time, which developers can search.

The PIT API is functionally similar to scroll, but it is much more lightweight, making it preferable to use. Using PIT means your pages will be consistent even if your data changes while you search.

Preferred Method of Pagination

Elasticsearch recommended using search_after with the PIT API for paginated searches that involve more than 10,000 documents and that preserve index state. Before version 7.10, the scroll API was the only method available that could do this. Be aware that AWS’s Elasticsearch implementation does not support Elasticsearch 7.10 yet, so developers will still need to use scroll for now.

Guidelines For Optimizing Elasticsearch Pagination

We have reviewed several limitations of the different available Elasticsearch pagination methods available. Which method you use depends on your requirements and your data. 

  • Do you need to paginate over more than 10,000 documents? If so, you should use search_after or scroll.
  • Do you need to keep the page contents consistent? If so, use search_after with the point in time API. 
  • Do you need to keep the page contents consistent but do not have access to ELK version 7.10? If so, use the scroll API.
  • Do you need to support multiple searches at once? If so, you should avoid using scroll because of its high memory requirements.

Developers may use these principles when using a SaaS API to Elasticsearch like Coralogix’s Elastic API. This service allows you to have direct access to your Elasticsearch data, but also integrate tools to analyze and troubleshoot your data.

Where Modern Observability
and Financial Savvy Meet.

Live Webinar
Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy
April 30th at 12pm ET | 6pm CET
Save my Seat