DevOps · K8s · Volleyball · Travel  •  DevOps · K8s · Volleyball · Travel  •  DevOps · K8s · Volleyball · Travel
Explore NY Stream

Fix etcd "open wal error: wal: file not found"

— ny_wk

Fix etcd

If an etcd pod is crash-looping with open wal error: wal: file not found, your cluster's brain just lost its memory. etcd is the key-value store that holds all of Kubernetes/OpenShift state, and the WAL (write-ahead log) is how it guarantees that data survives a restart. When etcd can't find its WAL, it refuses to start. Here's how to think about it and recover.

What the error actually means

etcd writes every change to the WAL before applying it, so it can replay and stay consistent after a crash. "wal: file not found" means etcd looked in its data directory and the WAL files it expected aren't there — or aren't readable. The two usual causes:

  • On-disk data corruption — a bad shutdown, full disk, or storage fault damaged the data directory.
  • An invalid restore — a snapshot restore was done incorrectly, leaving the data dir inconsistent or pointing at the wrong path.

Recovery, step by step

1. Confirm which member is broken

In a multi-node control plane, identify the failing etcd member (the crashing pod's node). If the other members are healthy, you can rebuild just the bad one from the quorum.

2. Check the disk first

Make sure the node isn't out of space and the storage is healthy. Restoring onto a still-broken disk just reproduces the failure.

3. Restore from a snapshot (the reliable fix)

If you have an etcd snapshot, restore to a fresh data directory with etcdctl snapshot restore, then point etcd at that new directory. Never restore on top of the corrupt dir — restore to a clean path and swap.

4. Or re-add the member from a healthy quorum

With a healthy majority still running, remove the broken member, wipe its data directory, and re-add it so it resyncs the full state from the leader.

Prevent it next time

  • Take regular etcd snapshots — they are your only true safety net.
  • Watch disk space and I/O on control-plane nodes; etcd hates a full or slow disk.
  • Shut down cleanly and avoid hard-killing etcd or yanking storage.

Key takeaways

  • The error means etcd can't find/read its write-ahead log — usually disk corruption or a botched restore.
  • Fix by restoring from a snapshot to a clean data dir, or re-adding the member from a healthy quorum.
  • Always restore to a fresh directory, never over the corrupt one.
  • Regular snapshots + disk monitoring prevent repeats.

Frequently asked questions

Will I lose cluster data?

Not if a healthy etcd quorum still exists (the bad member resyncs from it) or you have a recent snapshot. Without either, you risk losing state since the last backup.

What is the WAL in etcd?

The write-ahead log — etcd records every change there before committing, so it can recover consistently after a crash.

Can I just delete the etcd data directory?

Only for a single broken member in a healthy multi-node cluster, so it resyncs. Never on the last/only member without a snapshot.

How do I avoid this?

Schedule etcd snapshots, keep disks healthy and roomy, and shut nodes down gracefully.

etcd problems feel scary because it's the source of truth — but with a snapshot and a clean restore path, this one is recoverable.