DAPE.work

Disk Watermark Exceeded

Severity: High
Elasticsearch Version: 7.17.0

Problem

Shard allocation refused due to disk watermark exceeded

Root Cause

Disk usage on data nodes exceeded high watermark, preventing shard allocation to avoid disk space exhaustion

How to Detect

Symptoms

  • Elasticsearch cluster health shows red or yellow status
  • Shard allocation failures in logs
  • Cluster state indicates shards unassigned due to disk issues

Commands

curl -X GET 'localhost:9200/_cluster/health?pretty'
curl -X GET 'localhost:9200/_cat/shards?v'
curl -X GET 'localhost:9200/_cluster/allocation/explain?pretty'

Remediation Steps

  1. Identify nodes with high disk usage using '_cat/allocation' API
  2. Free disk space by deleting unnecessary data or snapshots
  3. Adjust disk watermark settings temporarily: curl -X PUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"persistent": {"cluster.routing.allocation.disk.watermark.high": "85%"}}'
  4. Reroute shards away from overutilized nodes: curl -X POST 'localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{"commands": [{"move": {"index": "<index_name>", "shard": <shard_number>, "from_node": "<node_name>", "to_node": "<target_node>"}}]}']
  5. Monitor shard reallocation and disk usage
  6. Once disk space is freed, restore watermark settings to default

Prevention

  • Implement regular disk usage monitoring and alerts
  • Configure appropriate disk watermarks based on node capacity
  • Schedule routine cleanup of old indices and snapshots
  • Use data lifecycle management policies to automate data retention

Production Example

curl -X PUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"persistent": {"cluster.routing.allocation.disk.watermark.high": "85%"}}'