Running an ELK stack provides unrivaled benefits for your organization, however, ELK issues will inevitably crop up. ELK is scalable, and largely agnostic of internal infrastructure, making it a great asset for SMEs and enterprises. However, successfully deploying and running an ELK stack is not without its difficulties. In order to keep your ELK stack running at optimum performance, you need to familiarize yourself with some of the most common ELK issues.
An ELK stack deployed in a multi-system and application environment will generate a huge amount of data. Log files will come in varying degrees of usefulness, but if you aren’t in the position to be filtering, analyzing, and discarding the non-critical logs, storage, and associated costs will become your most pressing ELK issue. If you deployed your ELK stack on-premise, then you need to consider the dangers of storing all your logs on traditional disk storage. On-premise storage is finite in its capacity. Infrastructure teams can be left in a constant state of stress, ensuring there is sufficient capacity in their existing infrastructure to accommodate the ELK outputs. If your log files are mission-critical, then you’ll want to back them up, requiring more storage in a separate isolated environment.
Cloud-based storage offers a high degree of flexibility for your log files, benefiting from the scalability of on-demand resources. Whilst this is likely to be much cheaper than traditional disk storage, you’ll still require your own in-house expertise to manage the underlying cloud infrastructure. This expertise will help to select the correct storage class for your logs.
Indices form the core of Elasticsearch and the ELK stack. By Elastic’s own admission, the concept of indices is a complex one, with indices being responsible for both data distribution and data separation. The way that you configure your environment, and whether it’s an indexing intensive environment, will impact your ELK stack’s overall search performance.
The interconnectivity of the ELK stack also means that, should you carry out an upgrade to one aspect of the stack, your write indices functions are likely to be affected. This is a known problem with the recent Elasticsearch upgrade, rendering indices created by earlier versions of Beats incompatible with Kibana. Major upgrades from earlier versions of Elasticsearch require you to reindex any 5.x created indices you plan to take forward and delete those you do not need. Failure to do so will prevent Elasticsearch from starting and cause a spike in many other unexpected ELK issues.
Indices affecting performance can be tweaked in numerous different ways – shard allocation, configuring throttling, increasing your indices buffer are but a few. It takes advanced insights that ELK doesn’t provide out-of-the-box to understand which mitigating action to take. Coralogix provides specialized tools like Reindex to not only keep your indices in line but keep your overall ELK stack healthy.
An ELK cluster needs specific networking rules. Because the ELK Stack is made up of several constituent parts (Elasticsearch Logstash Beats and Kibana), you need to ensure that your network doesn’t get in the way of what your ELK Stack is trying to accomplish.
For example, if you are hosting Logstash on the ELK Stack server, client servers may time out or disconnect. You must ensure that routing rules are correct. If they are, you need to inspect firewall rules. This can require significant firewall configuration, venturing into the realms of network engineering.
A common networking issue is that Logstash isn’t even running on the ELK Stack server. This is easy to stop, but then you have to go back to basics with Logstash and ELK. A similar recurring problem is incorrectly configuring the subnet that your ELK cluster uses. It does not manifest itself in obvious ways, but it will cause recurring low-level issues.
Elasticsearch configuration is managed via a central configuration file, which manages general node settings as well as network settings. As we are now in the era of DevOps and Site Reliability Engineering, companies are increasingly losing the expertise to properly configure networking requirements such as hosts and ports.
Whilst these seem like both common and simple networking issues, the larger your systems, the more prevalent and frequent these issues are going to be and the more difficult they are to track down.
Well-balanced nodes are one of the secrets to a high-performing ELK Stack. Node balancing itself is a bit of a dark art. There are many aspects of the Stack to consider in your approach. It takes considerable expertise.
Whilst Elasticsearch will manage the location of all shards across the nodes in a cluster, you need to be mindful of the tasks you are assigning to your master node. This is a challenging problem, as you must assign nodes in the cluster several key functions like health checks, routing table management, and shard management.
Elasticsearch will try and balance shard allocation to nodes based on current disk usage, although as standard, Elasticsearch will not allocate shards to data nodes with more than 85% disk usage. Whilst this is a helpful inbuilt balancing function to prevent disk usage affecting overall cluster performance, the tuning of Elasticsearch’s disk-based shard allocation has a number of limitations that you need to both be aware of and stay on top of. Imbalanced nodes are a nefarious issue. It is not often something that a novice ELK user will consider.
Applications in your system will produce millions of logs. Many of these are likely to be low-priority info or debug logs. These logs are cluttered and dealing with them appropriately is vital to ensure your ELK stack runs effectively. Even if your ELK stack handles the additional volume, your users will need to query through this irrelevant data.
This increases the time taken to track down a bug or gain new business insights. Your time is valuable. Of all of these issues, this is the one issue that will go undetected for years. It will impact productivity and increase the total cost of ownership for your logs.
The volume of logs that are produced by a system causes issues beyond that of storage and performance. If you’re seeking to use logs for true monitoring and troubleshooting, then the difficulty of discerning the value of logs and any relevant patterns is compounded by the volume of the logs themselves. Other blog posts have discussed the varying value of different types of logs, but as an ELK stack user, you have to question why you selected that tool in the first place. Scale is the enemy and you need a platform that will effortlessly scale with you.
Coralogix processes millions of logs every day. We can help you cut out the noise, stay fast, and gain actionable insights, whenever you need them.