Nagios Interview questions

— ny_wk

Nagios interview questions almost always start with the same core themes: what Nagios is, how its scheduling daemon works, the difference between active and passive checks, SOFT vs HARD states, and how remote hosts are monitored with NRPE. This guide walks through the questions interviewers actually ask, with corrected, accurate answers you can explain confidently in a system administrator or DevOps interview.

Below the answers are organized by topic rather than dumped as a flat list, so you build a real mental model of how Nagios Core monitors infrastructure. Where the underlying technology is legacy, the modern equivalent is called out so you do not get caught off guard.

Nagios fundamentals: what it is and how it works

What is Nagios and how does it work?

Nagios is an open-source IT infrastructure monitoring system for hosts, services, and networks. Nagios Core runs on a server as a daemon (typically the nagios process), and it behaves like a scheduler: it periodically runs small programs called plugins that test the state of a host or service and return a result.

A plugin might PING a router, query CPU load, check whether a web server returns HTTP 200, or confirm a disk is not full. Nagios records each result, compares it to the previous state, and decides what to do next: log the event, run an event handler, or notify a contact by email or SMS. Operators view current status through the Nagios web interface (served by the CGIs or, in modern setups, Nagios Core 4 with a refreshed UI).

A key clarification interviewers like: Nagios does not contain the monitoring logic for each protocol itself. The intelligence lives in the plugins. Nagios simply schedules them, interprets their exit codes, and acts on state changes.

What are plugins, and what do their exit codes mean?

Plugins are compiled executables or scripts (Perl, Python, shell, etc.) that can run from the command line to check a host or service. They are the heart of Nagios. The official bundle is the Monitoring Plugins project (formerly nagios-plugins), which ships checks such as check_http, check_ping, check_disk, and check_load.

Nagios determines status from the plugin's exit code, not its text output:

Exit code	Service state	Host state (typical)
0	OK	UP
1	WARNING	UP or DOWN (interpreted)
2	CRITICAL	DOWN/UNREACHABLE
3	UNKNOWN	DOWN/UNREACHABLE

To discover any plugin's options, run it with -h or --help:

./check_http --help
./check_http -H www.example.com -p 443 --ssl

The first line of a plugin's stdout becomes the status text shown in the UI; anything after a pipe character (|) is treated as performance data for graphing.

Nagios configuration files: structure and locations

What is the main configuration file and where does it live?

The master file is nagios.cfg, normally found at /usr/local/nagios/etc/nagios.cfg on a source build. It defines global behavior and points to every other file Nagios reads. The three configuration layers an interviewer expects you to name are:

Resource file (resource.cfg): stores sensitive values such as paths, usernames, and passwords using $USERn$ macros. The CGIs never read it, so you can lock it down with 600 or 660 permissions.
Object definition files: where you define what to monitor and how: hosts, services, hostgroups, contacts, contactgroups, commands, and time periods.
CGI configuration file (cgi.cfg): controls the web interface, authentication, and which users may view or command which objects. It also references the main config so the CGIs know where object definitions live.

How are objects loaded into Nagios?

Inside nagios.cfg you tell Nagios where object definitions are with two directives:

cfg_file=/usr/local/nagios/etc/objects/hosts.cfg loads a single file.
cfg_dir=/usr/local/nagios/etc/conf.d recursively loads every .cfg file in a directory, which scales better for large environments.

Always validate before reloading with nagios -v /usr/local/nagios/etc/nagios.cfg. This pre-flight check catches typos and broken references before they take the daemon down.

Which runtime files should you know by name?

Beyond the config files, Nagios writes and reads several runtime files. Interviewers often ask you to map a filename to its purpose:

Directive	Purpose
`log_file`	Main current log (e.g. `/usr/local/nagios/var/nagios.log`).
`log_archive_path`	Destination for rotated log files.
`status_file`	Current status, comments, and downtime (`status.dat`); read by the CGIs. Recreated each time Nagios starts.
`state_retention_file`	Saves state, downtime, and comments at shutdown (`retention.dat`) so they survive a restart; requires `retain_state_information=1`.
`object_cache_file`	Cached copy of object definitions the CGIs read.
`command_file`	The external command pipe (FIFO) Nagios reads commands from.
`lock_file`	Holds the PID when Nagios runs as a daemon (`-d`).
`check_result_path`	Spool directory where check results queue before processing.
`host_perfdata_file` / `service_perfdata_file`	Where performance data is written when `process_performance_data=1`.
`debug_file`	Debug output, controlled by `debug_level` and `debug_verbosity`.

One important correction to a common misconception: the external command file is a named pipe (FIFO), not a regular file. It is created when Nagios starts and removed on shutdown. If a stale file is left behind, Nagios may refuse to start.

Nagios checks: active vs passive

What is the difference between active and passive checks?

This is one of the most common Nagios interview questions, so be precise. The defining difference is who initiates the check:

Active checks are initiated and executed by the Nagios daemon. Nagios runs a plugin, passes it the parameters, the plugin tests the host or service, and returns a result that Nagios processes.
Passive checks are initiated and performed by an external application. That application submits the result to Nagios by writing to the external command file; Nagios only processes it.

Active checks run on a schedule or on demand. Scheduled intervals depend on state: a host or service in a HARD state is checked at check_interval, while one in a SOFT state is rechecked at the shorter retry_interval. On-demand checks happen when Nagios needs fresh data immediately, for example to determine the reachability of a parent host before judging its children.

Passive checks shine for services that are asynchronous (SNMP traps, security alerts, batch job results) or sit behind a firewall where the monitoring server cannot reach them. The workflow is: an external app checks the resource, writes the result to the command file, Nagios reads the file and queues the result, and a check result reaper event processes the queue. Crucially, active and passive results share the same processing queue, so notifications, logging, and event handlers behave identically regardless of origin.

What does the host/service check execution option do?

The execute_host_checks and execute_service_checks directives control whether Nagios actively runs checks at startup. Set to 0, Nagios goes into a passive-only "sleep" mode and still accepts passive results; set to 1 (the default) it actively schedules checks. This is typical on a backup monitoring server in a distributed setup. Note the gotcha: if state retention is on and use_retained_program_state is enabled, Nagios uses the last saved value on restart and ignores the config file, so you must change it via an external command or the web UI.

State logic: SOFT vs HARD states, stalking, and flapping

Explain SOFT and HARD states.

The current condition of any object has two parts: its status (OK, WARNING, CRITICAL, UP, DOWN, UNREACHABLE) and its state type (SOFT or HARD). State type controls when event handlers fire and when notifications are sent, which is why it matters so much.

SOFT state: a non-OK/non-UP result has occurred but the check has not yet been retried max_check_attempts times. Nagios is still confirming the problem. During a SOFT state it logs (only if log_service_retries/log_host_retries is on) and runs event handlers, but it does not notify contacts. This window is your chance to auto-remediate before alerting humans.
HARD state: the problem persisted through all max_check_attempts retries, or the object transitioned from one error state to another (WARNING to CRITICAL), or it recovered. Only in a HARD state does Nagios notify contacts.

Event handler scripts can read the $HOSTSTATETYPE$ or $SERVICESTATETYPE$ macro, which is set to SOFT or HARD, so corrective actions only run at the right moment.

What is state stalking?

State stalking is purely for logging. Normally Nagios only logs a check when the state actually changes. With stalking enabled for chosen states of a host or service, Nagios logs the result whenever the plugin's output text differs from the previous check, even if the state stayed the same. It is a diagnostic aid for spotting changing detail (a fluctuating error message, for example) during later log analysis. It does not trigger extra notifications.

How does flap detection work?

Flapping happens when a host or service changes state too often, producing a storm of problem and recovery notifications. It usually points to thresholds set too tight, a genuinely unstable service, or a real intermittent network fault. Nagios detects it as follows:

It stores the results of the last 21 checks of the object.
It examines that history to find where state transitions occurred (up to 20 possible transitions across 21 samples).
It computes a percent state change value, a measure of volatility. A service that never changes scores 0%; one that changes on every check scores 100%.
It compares that value to the low_flap_threshold and high_flap_threshold.
The object is flagged as started flapping when the percentage first exceeds the high threshold, and stopped flapping when it drops below the low threshold.

While flapping, Nagios suppresses normal notifications and logs the condition so you are not buried in noise.

Remote monitoring: NRPE, external commands, and distributed setups

What is NRPE and how does it work?

NRPE (Nagios Remote Plugin Executor) lets Nagios run plugins on remote Linux/Unix machines so it can monitor "local" resources, CPU load, memory, disk, running processes, that are not exposed over the network. It has two parts:

The check_nrpe plugin on the central monitoring server.
The NRPE daemon running on each remote host (commonly via xinetd or systemd).

The flow is: Nagios runs check_nrpe and tells it which command to execute; check_nrpe connects to the remote NRPE daemon over a TLS-protected TCP connection (default port 5666); the daemon runs the matching local plugin defined in nrpe.cfg; and the result travels back through check_nrpe to Nagios. Security note for modern deployments: older NRPE relied on weak anonymous SSL ciphers, so use a recent NRPE release, restrict allowed_hosts, and prefer TLS with proper certificates. Alternatives worth mentioning are check_by_ssh and agent-based options like NSClient++ for Windows.

What are external commands and when are they processed?

External commands let other applications, including the CGIs, change Nagios at runtime: disabling notifications, scheduling downtime, forcing an immediate check, acknowledging a problem, or adding a comment. Applications submit them by writing to the command file (the FIFO), in the format [time] command_id;command_arguments, where time is in Unix time_t format.

Nagios processes the command file at intervals set by command_check_interval, and additionally immediately after an event handler runs, so an event handler that submits a command takes effect without delay.

Explain distributed monitoring in Nagios.

Distributed monitoring spreads the workload across servers. One or more distributed (poller) servers actively check the services for a cluster of hosts, often a network segment behind its own firewall or across a WAN. A distributed server is usually a bare-bones install: it does not need the web UI, notifications, or event handlers; it just runs checks and forwards the results.

The central server mostly listens: it receives passive check results submitted by the distributed servers (historically via OCSP/OCHP commands and an addon such as NSCA, or today via tools like NRDP). The central server runs active checks only as a fallback. This design reduces load and lets you monitor isolated network zones from one console.

Legacy components: objects and NDOUTILS

What are objects in Nagios?

Objects are every element involved in monitoring and notification logic. Know all of them:

Hosts: physical or virtual devices, servers, routers, switches, printers.
Services: things you check on a host, CPU load, disk usage, a daemon, uptime.
Host groups / service groups: collections that simplify configuration and the dashboard view.
Contacts / contact groups: people who get notified and how to reach them.
Commands: the actual program or script Nagios runs for checks and notifications.
Time periods: when monitoring and notifications are allowed (e.g. business hours).
Notification escalations: rules that escalate alerts to higher tiers after repeated unacknowledged problems.

What is NDOUTILS and what are its components?

NDOUTILS (Nagios Data Output Utilities) stores Nagios configuration and event data in a relational database, historically MySQL with PostgreSQL support added later, for faster retrieval and reporting. Each Nagios process, standalone or part of a distributed/failover setup, is an instance identified by a unique name to keep data separate. Its four components are:

NDOMOD: a Nagios event broker module (ndomod.o) loaded into the daemon that exports config and runtime event data to a file, a Unix domain socket, or a TCP socket.
LOG2NDO: imports historical Nagios/NetSaint log files into the database via the NDO2DB daemon.
FILE2SOCK: reads from a file or STDIN and forwards the data unchanged to a Unix or TCP socket.
NDO2DB: the daemon that receives output from NDOMOD/LOG2NDO and writes it to the database, handling multiple simultaneous clients.

Modern equivalent: NDOUTILS is now legacy. Current Nagios environments more often use the database backend tied to Nagios XI, or pivot to broader open-source stacks such as Icinga 2 (a Nagios fork with a native API and the IDO/Icinga DB backend) or Prometheus with Grafana for metrics-first monitoring. Mentioning these shows interviewers you understand where the ecosystem has moved.

Verification: confirm your Nagios setup is healthy

After any configuration change, validate before reloading so you never restart into a broken state:

Validate the config: /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg (expect Total Warnings: 0 and Total Errors: 0).
Reload cleanly: systemctl reload nagios (or service nagios reload).
Confirm the daemon is running and listening, and that the web UI shows current host/service status.
Test a single plugin by hand, e.g. ./check_nrpe -H 10.0.0.5 -c check_load, to confirm remote checks return data.
Watch tail -f /usr/local/nagios/var/nagios.log for processed results and notifications.

Common pitfalls to mention in an interview

Confusing status (OK/WARNING/CRITICAL) with state type (SOFT/HARD), notifications only fire on HARD states.
Forgetting that the command file is a FIFO, not a regular file.
Assuming notifications go out during SOFT states, they do not.
Believing Nagios contains protocol logic, the logic lives in the plugins and their exit codes.
Treating NRPE's old SSL as secure, modern deployments must harden it or use alternatives.

Key Takeaways

Nagios is a scheduler that runs plugins and acts on their numeric exit codes (0 OK, 1 WARNING, 2 CRITICAL, 3 UNKNOWN).
Active checks are initiated by Nagios; passive checks are submitted by external apps through the command file FIFO.
SOFT states run event handlers only; HARD states (after max_check_attempts) are when contacts get notified.
NRPE (port 5666) executes plugins on remote hosts to monitor local resources behind firewalls.
NDOUTILS is legacy, modern stacks favor Nagios XI, Icinga 2, or Prometheus and Grafana.

Frequently Asked Questions

What are the default Nagios exit codes?

Plugins return 0 for OK, 1 for WARNING, 2 for CRITICAL, and 3 for UNKNOWN. Nagios reads only the exit code to set the state and uses the first line of stdout as the status message.

What is the difference between Nagios Core and Nagios XI?

Nagios Core is the free, open-source engine you configure with text files. Nagios XI is the commercial product built on Core that adds a polished web UI, configuration wizards, dashboards, reporting, and a database backend, aimed at teams that want less manual setup.

What port does NRPE use?

NRPE listens on TCP port 5666 by default. The central server's check_nrpe plugin connects to that port over a TLS-protected link to run plugins on the remote host.

Is Nagios still relevant compared to Prometheus?

Yes for check-based, status-and-alert monitoring of hosts and services, especially with NRPE agents and a mature plugin ecosystem. Prometheus is metrics-and-time-series first with a pull model and PromQL, and pairs with Grafana and Alertmanager. Many shops run both, or migrate to Icinga 2 for a Nagios-compatible but modernized engine.

For more hands-on DevOps and system administration walkthroughs, subscribe on YouTube @explorenystream.