Nagios Interview questions

— ny_wk

Nagios interview questions almost always probe how the monitoring engine actually behaves: how checks are scheduled, what each configuration file does, the difference between SOFT and HARD states, and how remote checks run through NRPE. This guide turns the most common questions into clear, corrected answers you can use to prepare for a Linux system administration or DevOps interview.

Rather than dumping raw questions, the sections below group the topics the way an interviewer thinks about them — architecture, configuration, check logic, state and flap detection, and distributed monitoring — so you understand the why, not just the one-line answer. Where the older study material was inaccurate, this version fixes it. Note that Nagios Core remains widely deployed, but many shops have migrated to forks and successors such as Icinga 2, Naemon, or fully different stacks like Prometheus; mentioning that awareness in an interview signals modern context.

Nagios architecture and core concepts

What is Nagios and how does it work?

Nagios is an open-source infrastructure monitoring system that watches hosts, services, and network resources, then alerts you when something breaks or recovers. The Nagios Core daemon runs on a monitoring server and acts like a scheduler: at defined intervals it executes small programs called plugins, which probe a target (for example, ping a host, query HTTP, or check disk usage) and return a status.

Each plugin returns an exit code that Nagios interprets: 0 = OK, 1 = WARNING, 2 = CRITICAL, and 3 = UNKNOWN. Based on that result and its state logic, Nagios may send notifications (email, SMS, chat), run event handlers to take automated corrective action, and update the web interface (CGIs) that operators view. Results can also be pushed into Nagios from external systems as passive checks.

What are objects in Nagios?

Objects are the building blocks of your monitoring configuration — everything involved in the monitoring and notification logic. The main object types are:

Host — a physical or virtual device (server, router, switch, printer).
Service — something measured on a host (CPU load, disk space, a running process, an HTTP endpoint).
Host group / Service group — logical collections that simplify configuration and the status views.
Contact / Contact group — the people who get notified, and how.
Command — defines exactly what program or script Nagios runs for a check or a notification.
Time period — controls when checks run and when notifications are allowed.
Notification escalation — escalates alerts to higher tiers if a problem persists.

What are plugins and how do you use them?

Plugins are standalone executables or scripts (Perl, Python, Bash, C) that perform the actual checking. Nagios itself does not know how to test a web server or a disk — it delegates that to a plugin and reads the result. Almost every standard plugin documents its options with -h or --help. For example:

Find the plugin directory (commonly /usr/local/nagios/libexec/ or /usr/lib/nagios/plugins/).
Run the plugin directly to see usage: ./check_http --help
Test it by hand before wiring it into a service: ./check_http -H www.example.com
Read the exit code to confirm behavior: echo $?

Testing a plugin on the command line first is the single most useful debugging habit — if it does not work in the shell, it will not work inside Nagios.

Nagios configuration files explained

Configuration is one of the most asked-about areas in Nagios interview questions, because misconfiguration is the usual cause of real-world outages in monitoring. There are three logical categories of files.

The main, object, resource, and CGI files

File	Typical path	Purpose
Main configuration file	`/usr/local/nagios/etc/nagios.cfg`	Top-level settings; references all object and resource files.
Object definition files	`/usr/local/nagios/etc/objects/*.cfg`	Define hosts, services, groups, contacts, commands, time periods.
Resource file	`/usr/local/nagios/etc/resource.cfg`	Holds $USERn$ macros (paths, usernames, passwords); kept private.
CGI configuration file	`/usr/local/nagios/etc/cgi.cfg`	Controls the web interface and authorization.

The resource file is important to call out: it stores sensitive values such as credentials and directory paths via $USER1$ , $USER2$ , etc. The CGIs never read it, so you can lock it down with restrictive permissions like chmod 600 or 660. You can load multiple resource files by adding several resource_file directives — Nagios processes them all.

How are object files referenced from the main config?

Inside nagios.cfg you point Nagios at your object definitions two ways:

cfg_file=/usr/local/nagios/etc/objects/commands.cfg — load a single named file.
cfg_dir=/usr/local/nagios/etc/conf.d — load every .cfg file in a directory (handy for scaling).

What are the key runtime files Nagios uses?

Expect to be asked what files Nagios writes at runtime and where. The important ones:

Log file — log_file=/usr/local/nagios/var/nagios.log: the current event log.
Status file — status_file=/usr/local/nagios/var/status.dat: current status for the CGIs; deleted on stop, recreated on start.
State retention file — state_retention_file=/usr/local/nagios/var/retention.dat: preserves status, comments, and downtime across restarts (requires retain_state_information=1).
Object cache / precache — object_cache_file and precached_object_file: parsed copies of object data for faster startup.
External command file — command_file=/usr/local/nagios/var/rw/nagios.cmd: a named pipe (FIFO) Nagios reads commands from; created at start, removed at shutdown.
Lock file — lock_file=/usr/local/nagios/var/nagios.lock: holds the daemon PID when run with -d.
Log archive path — log_archive_path=/usr/local/nagios/var/archives/: where rotated logs land.
Check result path — check_result_path=/usr/local/nagios/var/spool/checkresults: scratch space for check results before processing (dedicated; do not store other files there).
Performance data files — host_perfdata_file and service_perfdata_file: write perfdata after each check when process_performance_data=1 is enabled.
Debug file — debug_file=/usr/local/nagios/var/nagios.debug: controlled by debug_level and debug_verbosity.
Temp path — temp_path=/tmp: scratch space; old files should be cleaned periodically.

One correction worth knowing: the external command file is a named pipe, not a regular file. If a stale nagios.cmd already exists at startup, Nagios will refuse to start cleanly — a classic gotcha.

Check execution: active vs passive checks

What is the execute host/service checks option?

The directives execute_host_checks and execute_service_checks decide whether Nagios actively runs checks at all on startup. Set to 1 (default) it runs them; set to 0 it stays in a passive "sleep" mode and only accepts results pushed in. This is used for backup monitoring servers and distributed setups. Important caveat: if state retention is on and use_retained_program_state is enabled, Nagios ignores the configured value at restart and uses the last retained setting — you then change it via an external command or the web UI.

What is the difference between active and passive checks?

This is one of the most frequently asked Nagios interview questions, so be precise:

Aspect	Active check	Passive check
Initiated by	The Nagios daemon	An external application or process
Schedule	Regular intervals or on-demand	Whenever the external source decides
Typical use	Ping, HTTP, disk, CPU polling	SNMP traps, security alerts, async events
Behind a firewall	Hard to reach	Works well (results pushed in)

Active checks run on the schedule defined by check_interval (when in a HARD state) or retry_interval (when in a SOFT state), plus on-demand when Nagios needs fresh data — for example, when determining host reachability through parent/child relationships.

Passive checks flow differently: an external program checks something, writes the result to the external command file, and on its next read Nagios queues that result. A periodic check result reaper processes the queue — the very same queue used for active results — so passive and active results are handled identically once they arrive.

What are external commands and when are they processed?

External commands let other applications (including the CGIs) change Nagios behavior at runtime: disabling notifications, scheduling downtime, forcing an immediate check, or adding comments. They are written to the external command file in the format:

[timestamp] command_id;command_arguments

Nagios processes them at intervals set by command_check_interval, and additionally immediately after event handlers run, so handler-submitted commands take effect quickly.

Nagios state types, stalking, and flap detection

What are SOFT and HARD states in Nagios?

A check's current condition has two parts: the status (OK, WARNING, CRITICAL, UP, DOWN, UNREACHABLE) and the state type (SOFT or HARD). The state type controls when notifications fire — a crucial distinction interviewers love.

SOFT state — a non-OK result has occurred but the check has not yet been retried max_check_attempts times. Nagios runs event handlers and (optionally) logs it, but does not notify contacts. This window is for proactively fixing a blip before it becomes real.
HARD state — the problem has persisted through max_check_attempts, or it transitioned between error states (WARNING→CRITICAL), or a service is non-OK while its host is DOWN/UNREACHABLE. HARD states are logged, run event handlers, and notify contacts.

Event handler scripts can read the $HOSTSTATETYPE$ / $SERVICESTATETYPE$ macros ("SOFT" or "HARD") to decide whether to act. A passive host check is treated as HARD unless passive_host_checks_are_soft is enabled.

What is state stalking?

Stalking is purely a logging feature. Normally Nagios logs a check result only when the host or service changes state. With stalking enabled for a host or service, Nagios also logs a result whenever the plugin output changes — even if the state stayed the same. That extra detail is valuable for later forensic analysis of intermittent problems.

How does flap detection work?

A host or service is flapping when it changes state too frequently, producing a storm of problem/recovery notifications — often a sign of thresholds set too tight or a genuinely unstable service. Nagios detects this as follows:

It stores the results of the last 21 checks (allowing up to 20 transitions between them).
It examines those results to find where state transitions occurred.
It computes a percent state change — 0% means the state never changed, 100% means it changed on every check — with recent transitions weighted more heavily.
It compares that value to the configured low and high flapping thresholds.
Flapping starts when the value first exceeds the high threshold, and stops when it drops below the low threshold.

While flapping, Nagios suppresses notifications and adds a comment so operators understand why alerts went quiet. Enable it globally with enable_flap_detection=1.

Distributed monitoring and remote checks

What is distributed monitoring in Nagios?

Distributed monitoring spreads the checking workload across multiple Nagios servers. Distributed servers actively check the services for a "cluster" (any arbitrary group of hosts), then submit those results as passive checks to a central server. The distributed server is usually a bare-bones install — no web UI or notifications required — while the central server aggregates results, sends notifications, and only performs active checks itself in emergencies (for example, if a distributed server goes silent). This scales monitoring across WAN segments and firewalled networks.

What is NRPE and how does it work?

NRPE (Nagios Remote Plugin Executor) lets the monitoring server run plugins on a remote Linux/Unix host to check local resources — CPU load, memory, disk — that are not exposed to the network. It has two pieces:

check_nrpe — the plugin that runs on the central monitoring server.
NRPE daemon — the agent running on the remote host.

The flow: Nagios runs check_nrpe and names the remote command to run; check_nrpe connects to the NRPE daemon over a (optionally SSL/TLS-protected) connection; the daemon executes the corresponding local plugin; and the result travels back through check_nrpe to Nagios. A typical manual test looks like:

./check_nrpe -H 10.0.0.20 -c check_disk

Be aware NRPE has limits on returned output length and you must register each command in the remote nrpe.cfg — common interview follow-ups. Alternatives include NSClient++ for Windows and check_by_ssh when you prefer SSH over a dedicated agent.

What is NDOUtils and what are its components?

NDOUtils (Nagios Data Output Utilities) stores Nagios configuration and event data in a database — historically MySQL — for faster retrieval and richer reporting. Each Nagios process is treated as a uniquely named instance, which keeps data clean across standalone, distributed, redundant, and failover setups. Its four components are:

NDOMOD — a Nagios event broker module loaded into the daemon that exports config and runtime data to a file, Unix socket, or TCP socket.
LOG2NDO — imports historical Nagios/NetSaint log files into the database via the NDO2DB daemon.
FILE2SOCK — reads from a file or STDIN and writes the raw data to a Unix or TCP socket.
NDO2DB — the daemon that consumes data from NDOMOD/LOG2NDO and stores it in the database, spawning a process per connected client.

Modern context to mention: NDOUtils is legacy; current deployments more often use database backends like those in Icinga 2 (IDO / Icinga DB) or push metrics to time-series stores, so cite NDOUtils as the classic answer while noting the modern alternative.

Key Takeaways

Plugins do the work, the daemon schedules it — Nagios reads plugin exit codes (0/1/2/3) and acts on them.
Know your config files — nagios.cfg (main), object files, resource.cfg (secrets via $USERn$ ), and cgi.cfg (web UI).
SOFT vs HARD is the heart of notifications — contacts are only notified on HARD states, after max_check_attempts.
Active vs passive — active checks are run by Nagios on a schedule; passive results are pushed in via the external command file.
Remote and scaled monitoring — NRPE runs plugins on remote hosts; distributed servers feed a central server via passive checks.

Frequently Asked Questions

Is Nagios still relevant in 2026?

Yes for many enterprises, but it is no longer the default everywhere. Nagios Core is mature and stable, while teams increasingly choose forks like Icinga 2 and Naemon, or metrics-first stacks like Prometheus with Alertmanager and Grafana. Knowing Nagios fundamentals still transfers directly to those tools.

What exit codes do Nagios plugins return?

Four standard codes: 0 OK, 1 WARNING, 2 CRITICAL, and 3 UNKNOWN. Nagios maps these to host/service states and uses them with state-type logic to decide on notifications and event handlers.

What is the difference between NRPE and SNMP for monitoring?

NRPE installs an agent on the remote host and runs full Nagios plugins locally, giving deep access to OS-level metrics. SNMP is agentless on many devices and ideal for network gear (routers, switches) where you cannot install software. Choose NRPE for servers you control and SNMP for appliances and infrastructure.

How do I troubleshoot a Nagios check that always shows UNKNOWN?

Run the plugin by hand as the nagios user, check the exit code with echo $?, verify the command definition and arguments in your object files, and confirm file permissions on the plugin and any resource paths. Most UNKNOWN results come from a plugin that errors out or is missing required arguments.

If this helped your interview prep, subscribe to @explorenystream on YouTube for more Linux and DevOps walkthroughs.