Backing Up and Restoring OpenShift: A Practical Guide

— ny_wk

An OpenShift cluster is only as safe as its last good backup. When a master goes bad or etcd corrupts, you don't want to be improvising recovery for the first time. Here's a clear, practical model for what to back up and how to restore it.

What you actually need to back up

Three things let you rebuild a cluster:

etcd — the source of truth for all cluster state (projects, deployments, secrets, everything). This is the most important backup.
Node and master configuration files — the node config, certificates, and any custom settings each machine needs.
Scheduled cron jobs that run the backups themselves, so the process is repeatable and distributed across nodes for high availability.

A sane backup routine

Automate it with scheduled jobs that run nightly:

Take an etcd snapshot from the data directory, then copy the snapshot to a remote system (never keep your only backup on the same machine).
Collect each node's config files (node config, cron entries, certs) to the same remote store.
Compress and timestamp each backup, and rotate old ones so you keep a useful history without filling the disk.

Distributing these jobs across nodes means losing one machine doesn't lose your ability to recover.

Restoring after a failure

1. Restore etcd first

etcd is the foundation. Restore the most recent healthy snapshot to a clean data directory (never on top of a corrupt one), then bring etcd back up and confirm the members form a healthy quorum.

2. Restore configuration

Put each node's saved config files back in place and restart the relevant services so masters and nodes rejoin with their correct identity and certificates.

3. Verify the cluster

Check that nodes report Ready, core components are running, and a test workload schedules and runs. Only then call the restore complete.

Key takeaways

Back up etcd (cluster state), node/master config files, and the backup cron jobs themselves.
Always copy backups to a remote system and distribute jobs across nodes for HA.
Restore etcd first, to a clean data directory, then configs, then verify.
Practice the restore before you need it — untested backups are just hope.

Frequently asked questions

What's the most critical thing to back up?

etcd — it holds all cluster state. Without it, you're rebuilding from scratch.

How often should I back up etcd?

At least nightly, and before any risky change (upgrades, major reconfig). Frequent snapshots shrink how much you can lose.

Can I restore etcd over the existing data directory?

No — always restore to a fresh directory and switch to it. Restoring over corrupt data reproduces the failure.

Why store backups remotely?

If the backup lives only on the failed machine, it dies with it. Remote copies are the whole point of a backup.

Good OpenShift recovery isn't heroics — it's a boring, tested routine: snapshot etcd, save configs, ship them off-box, and rehearse the restore.