10 Elasticsearch Configurations You Have to Get Right

Elasticsearch is an open source, distributed document store and search engine that stores and retrieves data structures. As a distributed tool, Elasticsearch is highly scalable and offers advanced search capabilities. All of this adds up to a tool which can support a multitude of critical business needs and use cases.
To follow are ten of the key Elasticsearch configurations are the most critical to get right when setting up and running your instance.
Ten Key Elasticsearch Configurations To Apply To Your Cluster:
1. Cluster Name
A node can only join a cluster when it shares its cluster.name with all the other nodes in the cluster. The default name is elasticsearch, but you should change it to an appropriate name that describes the purpose of the cluster. The cluster name is configured in the elasticsearch.yml file specific to environment.
You should ensure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. When you have a lot of nodes in your cluster, it is a good idea to keep the naming flags as consistent as possible.
2. Node Name
Elasticsearch uses node.name as a human readable identifier for a particular instance of Elasticsearch. This name is included in the response of many APIs. The node name defaults to the hostname of the machine when Elasticsearch starts. It is worth configuring a more meaningful name which will also have the advantage of persisting after restarting the node. You can configure the elasticsearch.yml to set the node name to an environment variable. For example node.name: ${FOO}
By default, Elasticsearch will use the first seven characters of the randomly generated Universally Unique identifier (UUID) as the node id. Note that the node.id persists and does not change when a node restarts and therefore the default node name will also not change.
3. Path Settings
For any installations, Elasticsearch writes data and logs to the respective data and logs subdirectories of $ES_HOME by default. If these important folders are left in their default locations, there is a high risk of them being deleted while upgrading Elasticsearch to a new version.
The path.data setting can be set to multiple paths, in which case all paths will be used to store data. This being the paths to directories for multiple locations separated by comma. Elasticsearch stores the node’s data across all provided paths but keeps each shard’s data on the same path.
Elasticsearch does not balance shards across a node’s data paths. It will not add shards to the node, even if the node’s other paths have available disk space. If you need additional disk space, it is recommended you add a new node rather than additional data paths.
In a production environment, it is strongly recommended you set the path.data and path.logs in elasticsearch.yml to locations outside of $ES_HOME.
4. Network Host
By default, Elasticsearch binds to loopback addresses only. To form a cluster with nodes on other servers, your node will need to bind to a non-loopback address.
More than one node can be started from the same $ES_HOME location on a single node. This setup can be useful for testing Elasticsearch’s ability to form clusters, but it is not a configuration recommended for production.
While there are many network settings, usually all you need to configure is network.host property. The network.host property is in your elaticsearch.yml file. This property simultaneously sets both the bind host and the publish host. The bind host being where Elasticsearch listens for requests and the publish host or IP address being what Elasticsearch uses to communicate with other nodes.
The network.host setting also understands some special values such as _local_, _site_ and modifiers like :ip4. For example to bind to all IPv4 addresses on the local machine, update the network.host property in the elasticsearch.yml to network.host: 0.0.0.0. Using the _local_ special value configures Elasticsearch to also listen on all loopback devices. This will allow you to use the Elasticsearch HTTP API locally, from each server, by sending requests to localhost.
When you provide a custom setting for network.host, Elasticsearch assumes that you are moving from development mode to production mode, and upgrades a number of system startup checks from warnings to exceptions.
5. Discovery Settings
There are two important discovery settings that should be configured before going to production, so that nodes in the cluster can discover each other and elect a master node.
‘discovery.seed_hosts’ Setting
When you want to form a cluster with nodes on other hosts, use the static discovery.seed_hosts setting. This provides a list of master-eligible nodes in the cluster. Each value has the format host:port or host, where port defaults to the transport.profiles.default.port. It is the case that IPv6 hosts must be bracketed. The default value is [“127.0.0.1”, “[::1]”].
This setting accepts a YAML sequence or array of the addresses of all the master-eligible nodes in the cluster. Each address can be either an IP address or a hostname that resolves to one or more IP addresses via DNS.
‘cluster.initial_master_nodes’ Setting
When starting an Elasticsearch cluster for the very first time, use the cluster.initial_master_nodes setting. This defines a list of the node names or transport addresses of the initial set of master-eligible nodes in a brand-new cluster. By default this list is empty, meaning that this node expects to join a cluster that has already been bootstrapped. This setting is ignored once the cluster is formed. Do not use this setting when restarting a cluster or adding a new node to an existing cluster.
When you start an Elasticsearch cluster for the first time, a cluster bootstrapping step determines the set of master-eligible nodes whose votes are counted in the first election. Because auto-bootstrapping is inherently unsafe, when starting a new cluster in production mode, you must explicitly list the master-eligible nodes whose votes should be counted in the very first election. You set this list using the cluster.initial_master_nodes setting.
6. Heap Size
By default, Elasticsearch tells the JVM to use a heap with a minimum and maximum size of 1 GB. When moving to production, it is important to configure heap size to ensure that Elasticsearch has enough heap available.
Elasticsearch will assign the entire heap specified in jvm.options via the Xms (minimum heap size) and Xmx (maximum heap size) settings. These two settings must be equal to each other.
The value for these settings depends on the amount of RAM available on your server.
Good rules of thumb are:
- The more heap available to Elasticsearch, the more memory it can use for caching. But note that too much heap can subject you to long garbage collection pauses
- Set Xms and Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches. Elasticsearch requires memory for purposes other than the JVM heap and it is important to leave space for this
- Don’t set Xms and Xmx to above the cutoff that the JVM uses for compressed object pointers. The exact threshold varies but is near 32 GB
The more heap available to Elasticsearch, the more memory it can use for its internal caches, but the less memory it leaves available for the operating system to use for the filesystem cache. Also, larger heaps can cause longer garbage collection pauses.
It is very important to understand resource utilization during the testing process because it allows you to configure not only your JVM heap space, but your CPU capacity, reserve the proper amount of RAM for nodes, and provision through scaling larger instances with potentially more nodes and optimize your overall testing process.
7. Heap Dump Path
By default, Elasticsearch configures the JVM to dump the heap on out of memory exceptions to the default data directory. If this path is not suitable for receiving heap dumps, you should modify the entry -XX:HeapDumpPath=… in jvm.options.
If you specify a directory, the JVM will generate a filename for the heap dump based on the PID of the running instance. If you specify a fixed filename instead of a directory, the file must not exist when the JVM needs to perform a heap dump on an out of memory exception, otherwise the heap dump will fail.
8. GC Logging
By default, Elasticsearch enables GC logs. These are configured in ‘jvm.options’ and default to the same default location as the Elasticsearch logs. The default configuration rotates the logs every 64 MB and can consume up to 2 GB of disk space. Unless you change the default jvm.options file directly, the Elasticsearch default configuration is applied in addition to your own settings.
Internally, Elasticsearch has a JVM GC Monitor Service (JvmGcMonitorService) which monitors the GC problem smartly. This service logs the GC activity if some GC problems were detected. According to the severity, the logs will be written at different levels (DEBUG/INFO/WARN). In Elasticsearch 6.x and Elasticsearch 7.x, two GC problems are logged: GC slowness and GC overhead. GC slowness means the GC takes too long to execute. GC overhead means the GC activity exceeds a certain percentage in a fraction of time.
If you want to tune the garbage collector settings, you need to change the GC options. Elasticsearch warns you about this in the jvm.options file: ‘All the (GC) settings below are considered expert settings. Don’t tamper with them unless you understand what you are doing.‘. Depending on the distribution you used, there are different ways to change the options. They comprise of either:
- overriding JVM options via JVM options files either from config/jvm.options or config/jvm.options.d/
- settings the JVM options via the ES_JAVA_OPTS environment variable
9. JVM Fatal Error Log Setting
By default, Elasticsearch configures the JVM to write fatal error logs to the default logging directory. These are logs produced by the JVM when it encounters a fatal error, such as a segmentation fault.
If this path is not suitable for receiving logs, modify the -XX:ErrorFile=… entry in jvm.options. On Linux and MacOS and Windows distributions, the logs directory is located under the root of the Elasticsearch installation. On RPM and Debian packages, this directory is /var/log/elasticsearch.
10. Temporary Directory
By default, Elasticsearch uses a private temporary directory that the startup script creates immediately below the system temporary directory.
On some Linux distributions a system utility will clean files and directories from /tmp if they have not been recently accessed. This can lead to the private temporary directory being removed while Elasticsearch is running if features that require the temporary directory are not used for a long time. This causes problems if a feature that requires the temporary directory is subsequently used.
If you install Elasticsearch using the .deb or .rpm packages and run it under systemd then the private temporary directory that Elasticsearch uses is excluded from periodic cleanup.
However, if you intend to run the .tar.gz distribution on Linux for an extended period then you should consider creating a dedicated temporary directory for Elasticsearch that is not under a path that will have old files and directories cleaned from it. This directory should have permissions set so that only the user that Elasticsearch runs as can access it. Then set the $ES_TMPDIR environment variable to point to it before starting Elasticsearch.
Wrap-Up
As versatile, scalable and useful as Elasticsearch is, it’s essential that the infrastructure which hosts your cluster meets its needs, and that the cluster is sized correctly to support its data store and the volume of requests it must handle. Improperly sized infrastructure and misconfigurations can result in everything from sluggish performance to the entire cluster becoming unresponsive and crashing.
Appropriately monitoring your cluster or instance can help you ensure that it is appropriately sized and that it handles all data requests efficiently.