Tracking Elasticsearch Metrics

Author: Satish Rakhonde | 9 min read | January 7, 2025

Elasticsearch is a powerful, distributed search and analytics engine based on the open-source Apache Lucene project. It’s designed to handle large volumes of data and provide fast search capabilities, making it a popular choice for various applications.

Elasticsearch is often used in scenarios like website search engines, logging and monitoring systems, data analysis, and more. It is commonly used in combination with other tools in the Elastic Stack (formerly known as the ELK Stack), which includes Logstash for data collection and transformation, and Kibana for data visualization.

Here are Elasticsearch’s key features:

Full-Text Search: It can search through large amounts of text and return relevant results quickly.
Distributed Architecture: It’s built to scale horizontally. This means you can add more servers to your cluster to handle larger amounts of data and higher query loads.
Real-Time Data: Once data is indexed, it becomes immediately searchable.
RESTful API: It uses a RESTful interface, that makes it relatively easy to integrate with other systems and applications.
Analytics: Besides search, Elasticsearch also provides powerful aggregation capabilities, allowing users to analyze data and generate reports.
Schema-Free: It supports schema-free JSON documents, meaning you can index and search complex data structures without predefined schemas.
High Availability: It includes features for redundancy and fault tolerance, such as replication and automatic failover.

Elasticsearch Monitoring Architecture:

Monitoring Elasticsearch is critical for ensuring that your search and analytics engine operates efficiently and reliably.

Monitoring a cluster involves gathering data from its various components, including Elasticsearch nodes, Logstash nodes, Kibana instances, and MetricBeat.

Monitoring metrics are stored in Elasticsearch, allowing seamless data visualization through Kibana. By default, these metrics are saved in local indices, ensuring efficient and accessible storage.

Metricbeat enables you to efficiently collect and transmit data from Elasticsearch, Kibana, Logstash, and Beats directly to your monitoring cluster, bypassing the need to route it through your production cluster. The diagram here illustrates a standard monitoring setup, featuring separate production and monitoring clusters for streamlined performance and better organization.

Metricbeat Overview

Metricbeat is a lightweight, open-source data shipper designed to monitor servers by collecting system and service metrics at regular intervals. It sends this data to Elasticsearch or Logstash for analysis and visualization, providing invaluable insights into the performance and health of your infrastructure. With Metricbeat, you can easily track key metrics to ensure your systems and services are running smoothly and efficiently.

A Metricbeat module defines the basic logic for collecting data from a specific service like Kibana and Elasticsearch.

A Metricbeat has a set of modules that fetches and structures the data. As an example, system modules would collect CPU, disk, and other information.

When monitoring an Elasticsearch cluster, there are several important features and metrics we should focus on to ensure the cluster’s health, performance, and stability.

Here are the key health indicators for Elasticsearch:

Cluster Health

Node and Resource Utilization

Indexing and Search Performance

Network Monitoring

Alerting in Elasticsearch

By focusing on these features, we can maintain a healthy Elasticsearch environment, promptly address any issues, and ensure optimal performance and reliability.

Cluster Health Metrics

Cluster Status:
- Green: All primary and replica shards are active.
- Yellow: All primary shards are active, but some replicas are not allocated.
- Red: Some primary shards are not active, indicating a severe issue.
Number of Nodes:
- Monitor the number of nodes in your cluster to ensure that it matches your intended configuration and that nodes are not being dropped or added unexpectedly.
Shard Allocation:
- Active Shards: Check the number of active primary and replica shards.
- Unassigned Shards: Monitor for any unassigned shards, which may indicate issues with shard allocation.

The below screenshot shows an overview of the ELK cluster, number of nodes, indices, etc.

Elasticsearch Node Health Metrics

CPU Usage:
- Track CPU utilization to ensure nodes are not overloaded. High CPU usage can indicate heavy query loads or inefficient operations.
Memory Usage:
- Heap Memory Usage: Monitor JVM heap memory to prevent out-of-memory errors. High heap usage can affect performance due to frequent garbage collection (GC).
- Non-Heap Memory Usage: Track non-heap memory usage for other JVM processes.
Disk Space:
- Monitor disk usage on each node to avoid running out of space, which can cause failures in indexing and searching.
Disk I/O:
- Measure disk read and write operations to identify potential bottlenecks in data storage and retrieval.

Indexing & Query Performance Metrics

Indexing Rate:
- Indexing Throughput: Track the number of documents indexed per second. A sudden drop can indicate issues with the indexing pipeline.
- Indexing Latency: Monitor the time it takes to index documents. High latency can impact search performance.
Indexing Errors:
- Check for errors or failures during the indexing process to address issues promptly.

Query Latency:
- Search Latency: Measure the time it takes to execute search queries. High query latency can affect user experience.
Search Throughput:
- Queries Per Second: Track the number of queries processed per second to ensure that your cluster can handle the load.

Garbage Collection (GC) Metrics

GC Time:
- Monitor the time spent in garbage collection processes. Frequent or lengthy GC pauses can affect cluster performance and responsiveness.
GC Frequency:
- Track how often garbage collection occurs to identify potential memory management issues.

Thread Pool Metrics

Thread Pool Queues:
- Queue Size: Monitor the size of thread pool queues to detect potential bottlenecks or delays in request processing.
Rejected Requests:
- Track the number of rejected requests to understand if thread pools are overloaded and causing performance issues.

Network Metrics

Network I/O:
- Monitor network traffic to ensure that data is being transmitted efficiently between nodes and to clients.
Network Latency:
- Measure the time it takes for data to travel across the network to detect network-related performance issues.

Resource Utilization Metrics

JVM Statistics:
- Heap Usage: Track heap memory usage and GC activity.

Cluster and Index Size

Index Size:
- Monitor the size of each index to ensure that no single index is growing too large and impacting performance.
Cluster Size:
- Track the overall size of the cluster data to plan for scaling and resource allocation.

Tools and Solutions for Elasticsearch Monitoring:

Elastic Stack (Elasticsearch, Kibana, Beats): Integrated monitoring and alerting capabilities through Elasticsearch’s APIs and Kibana dashboards.
Open-source tools such as Prometheus and Grafana can also be used for advanced metrics collection and visualization.
Commercial Solutions: Datadog, New Relic, and others offer Elasticsearch monitoring as part of their broader monitoring platforms.

By implementing a robust Elasticsearch monitoring system, you can proactively identify issues, optimize performance, and ensure the reliability of your Elasticsearch clusters. Regularly reviewing metrics and responding to alerts promptly helps maintain optimal cluster health and performance. Connect with our Elasticsearch experts to get more out of your data.

Authors

Suresh Hemke

Suresh brings over 7 years of experience as a NoSQL Database Administrator, focusing on large-scale NoSQL databases like MongoDB, Cassandra, and Elasticsearch. With a solid foundation in Linux systems and expertise in cloud platforms, including AWS, Azure and VMware, he excels at designing high-performance, scalable, and reliable database solutions. His skills also encompass performance tuning, system automation, and cloud-driven database management, utilizing cutting-edge tools and techniques to ensure seamless operations.

Satishe Rakhonde

With experience spanning across various OS flavors and Cloud DB platforms, Satish brings a rich blend of technical expertise to the table. At Datavail, he has been instrumental in delivering top-notch NoSQL DBA services, catering to a diverse clientele.

Satish Rakhonde’s vast repertoire includes Installation, Configuration, Backup & Recovery, Performance Tuning, Capacity Planning, Security, Upgrades, Maintenance, Monitoring, Replica Set Configuration, Sharding, and designing large-scale databases. His strategic approach and meticulous planning have consistently resulted in optimized performance and enhanced security for clients’ databases.

As a holder of an M.Sc. Information Technology degree with first division, Satish continues to make strides in his career. Currently based in Mumbai, he serves as a Technical Manager at Datavail. With over 21 years of experience under his belt, Satish not only brings expertise but also a wealth of knowledge to Datavail. He holds certifications in RHEL, AWS, Azure, and is a Certified Kubernetes Administrator. His areas of specialization are HC Analysis, Best Practices, High Available Solutions, Automation, and Disaster Recovery Solutions.

Contact an Expert »

Blog Author

Satish Rakhonde

Technical Manager

With experience spanning across various OS flavors and Cloud DB platforms, Satish brings a rich blend of technical expertise to the table. At Datavail, he has been instrumental in delivering top-notch NoSQL DBA services, catering to a diverse clientele. His vast repertoire includes Installation, Configuration, Backup & Recovery, Performance Tuning, Capacity Planning, Security, Upgrades, Maintenance, Monitoring, Replica Set Configuration, Sharding, and designing large-scale databases. His strategic approach and meticulous planning have consistently resulted in optimized performance and enhanced security for clients' databases.