Veritas cluster Interview Questions

— ny_wk

Disclosure: some links above are affiliate links — if you buy through them I may earn a small commission at no extra cost to you. Thanks for supporting the channel!

🛒 Today's Picks on Amazon

As an Amazon Associate I earn from qualifying purchases.

Veritas Cluster Server (VCS) interview questions for L2 and L3 roles cluster around six themes: architecture (LLT, GAB, HAD), service groups and resources, daily administration commands, split-brain prevention through I/O fencing, node add/remove operations, and troubleshooting faulted resources. This guide organizes the most frequently asked VCS interview questions into clear topic sections with corrected, production-accurate answers you can actually use on the job.

Veritas Cluster Server (now part of Veritas InfoScale Availability, formerly a Symantec product) connects multiple independent servers into a single high-availability framework. When a node or a monitored application fails, surviving nodes take predefined actions to restart the service elsewhere. The questions below assume Solaris, AIX, or Linux deployments, but the concepts and commands are the same across platforms.

VCS architecture: LLT, GAB and HAD

Almost every Veritas Cluster interview opens with the three core components. Get these crisp and the rest of the conversation flows.

What are LLT, GAB and HAD, and what does each do?

LLT (Low Latency Transport) is a kernel-level, non-routable Layer-2 protocol that carries all cluster traffic over the private interconnects. It sends and receives heartbeats and load-balances inter-node traffic across every configured link. If a link fails, traffic moves to the surviving links.
GAB (Group Membership Services / Atomic Broadcast) sits on top of LLT. It maintains cluster membership by tracking the heartbeats LLT delivers, and its atomic broadcast guarantees every node receives the same configuration and state changes in the same order.
HAD (High Availability Daemon) is the VCS engine itself. It runs on every node, holds the cluster configuration and state, and manages the agents that control resources. HAD is monitored by a companion process, hashadow, which restarts HAD if it dies.

How many LLT links are supported, and how many nodes per cluster?

LLT supports up to 8 network links (high and low priority combined). A single VCS cluster supports up to 32 nodes. You should always configure at least two interconnects so the loss of one link does not put the cluster into jeopardy.

What is a heartbeat in VCS?

A heartbeat is a small broadcast packet a node sends to announce it is alive. By default each node sends two heartbeat packets per second per interface. GAB consumes these heartbeats to decide cluster membership. When heartbeats stop arriving from a peer, GAB marks that node as down.

What are the GAB ports and what runs on each?

GAB multiplexes different cluster services onto named ports. Knowing the common ones signals real hands-on experience.

Port	Service
a	GAB driver itself
b	I/O fencing (data integrity)
d	ODM (Oracle Disk Manager)
f	CFS (Cluster File System)
h	VCS / HAD (high availability daemon)
o	VCSMM driver (Oracle/VCS membership)
v / w	CVM (Cluster Volume Manager) and vxconfigd

Verify membership with gabconfig -a. In a healthy two-node cluster you expect to see port a and port h each showing both node IDs in the membership list.

Service groups, resources and agents

The second big block of VCS interview questions covers the objects you administer every day.

What is a service group?

A service group is a virtual container that lets VCS manage an application as a single unit. It holds all the resources an application needs (storage, network, the application process) plus the dependency links between them. Service-group attributes such as SystemList and AutoStartList define where it can run and where it starts.

What are the service group types?

Failover — online on only one node at a time; VCS migrates it on fault or on request. This is the most common type.
Parallel — online on several nodes simultaneously, used by applications like Oracle RAC.
Hybrid — behaves as failover within a system zone and parallel across zones, used in replicated data clusters.

What is a resource, and what are the resource types?

A resource maps to a hardware or software component (a disk group, a mount, an IP, an application). Each resource has a unique name and belongs to a service group. There are three operational categories:

On-Off — VCS can bring it online and take it offline (most resources, e.g. a disk group).
On-Only — VCS can start it but never stops it (e.g. an NFS daemon shared by many groups).
Persistent — VCS can only monitor it, never online or offline it (e.g. a NIC). Its operation value is none; a persistent fault triggers failover.

What is an agent?

An agent is a multi-threaded process that implements the logic to manage one resource type. There is one agent per resource type per node, and that single agent manages every resource of that type. Agents have entry points (online, offline, monitor, clean) and periodically report each resource's state back to HAD.

What is the difference between critical and non-critical resources?

If a critical resource faults, the entire service group fails over. If a non-critical resource faults, the group keeps running and does not fail over. The Critical attribute (1 or 0) controls this per resource.

Everyday VCS administration commands

Interviewers love rapid-fire command questions. These are the commands that come up most, with the corrected syntax.

Status, configuration and verification

Cluster status summary: hastatus -sum
Resource state: hares -state <resource>
Main config file: main.cf, located in /etc/VRTSvcs/conf/config
Verify main.cf syntax: hacf -verify /etc/VRTSvcs/conf/config
Engine log: /var/VRTSvcs/log/engine_A.log

Making the configuration writable or read-only

The configuration is read-only by default. To change it you make it writable, edit, then dump it back to disk:

Make it writable: haconf -makerw
Save and lock it again: haconf -dump -makero

Service group operations (note the corrected commands)

A common trap: the source many candidates memorize wrongly labels the offline command as "online." Use these:

Bring a group online on a node: hagrp -online <group> -sys <node>
Take a group offline on a node: hagrp -offline <group> -sys <node>
Switch a group to another node: hagrp -switch <group> -to <node>
Online/offline by failover policy: hagrp -online <group> -any
Freeze (temporary): hagrp -freeze <group>
Freeze across reboots: hagrp -freeze -persistent <group>

Resource operations

Display all attributes of a resource: hares -display <resource>
List resource dependencies: hares -dep
Enable / disable a resource: hares -modify <resource> Enabled 1 or ... Enabled 0
Clear a faulted resource (after fixing the root cause): hares -clear <resource> -sys <node>
Probe a resource to refresh its state: hares -probe <resource> -sys <node>

Starting and stopping the stack

Start HAD on a node: hastart
Start / stop LLT: lltconfig -c / lltconfig -U
Start / stop GAB: gabconfig -c -n <seed> / gabconfig -U (stop LLT only after GAB)

How do you check VCS licensing?

On modern InfoScale and later VCS releases, report installed licenses with vxlicrep. The very old vxlicense -p command appears in legacy notes but has been superseded; if you are on InfoScale, use vxlicrep (or vxlicrepm for keyless/metered).

Stopping VCS without stopping applications

This is a classic L3 Veritas Cluster interview question because it tests whether you understand the difference between stopping the engine and stopping the workload.

The hastop variants and exactly what each does

hastop -local — takes service groups offline and stops HAD on the local node.
hastop -local -evacuate — migrates the local node's groups to other nodes, then stops HAD locally.
hastop -local -force — stops HAD but leaves the applications running on the local node.
hastop -all — takes all groups offline and stops HAD everywhere.
hastop -all -force — stops HAD on every node but leaves all applications running. This is how you shut down the cluster engine cluster-wide while keeping services up, typically for maintenance or upgrades.

Communication failures: jeopardy and split brain

Understanding why VCS uses fencing starts with these two failure states.

What is jeopardy membership?

When a node is reduced to only one remaining heartbeat link, GAB cannot reliably distinguish a coming network partition from a node failure. That node forms a regular membership with peers it still has multiple links to, and a special jeopardy membership with the peer it can only reach over the last link. While in jeopardy, VCS will not fail over a group on a system fault (to avoid the risk of starting it twice), although failover on a resource fault or by operator request still works. Fix the broken link and GAB clears the jeopardy state automatically.

What is a split-brain condition?

Split brain happens when all interconnect links fail at once. The cluster fractures into sub-clusters, each believing the others are dead. Each sub-cluster may try to import the same disk groups and start the same service groups, leading to simultaneous storage access and data corruption. Preventing this is the whole reason I/O fencing exists.

I/O fencing and coordinator disks

What is I/O fencing and how does it prevent split brain?

I/O fencing guarantees data integrity during a communication breakdown. VCS uses SCSI-3 Persistent Group Reservations (PGR) so that only members of the active cluster can write to shared storage; non-members are blocked, so even a live but isolated node cannot corrupt data. During a partition, the sub-clusters race to grab the coordination points. The winner ejects the loser's keys and fences it off; the losing node panics.

What is a coordinator disk?

Coordinator disks are dedicated LUNs (classically three, an odd number to guarantee a majority winner) reserved purely for fencing arbitration. They store no application data. A node must win control of a majority of coordination points before it is allowed to fence peers from the data disks. Modern deployments often use a coordination point server (CP server) instead of, or alongside, coordinator disks.

Adding and removing a cluster node

L3 candidates are expected to walk through node lifecycle operations end to end.

How do you add a node to an existing cluster?

Prepare hardware: connect the new node's private LLT interconnects and the shared storage.
Install software: install the VCS/InfoScale packages and apply the license on the new node.
Configure LLT and GAB: create /etc/llthosts, /etc/llttab and /etc/gabtab on the new node and update them on the existing nodes so node IDs and the seed count are consistent.
Register the node from an existing node: haconf -makerw, then hasys -add <newnode>.
Copy the configuration if needed: scp /etc/VRTSvcs/conf/config/main.cf newnode:/etc/VRTSvcs/conf/config/, then haconf -dump -makero.
Start VCS on the new node: hastart.
Verify: run gabconfig -a on each node and confirm port a and port h now include the new node.

How do you remove a node from a cluster?

Back up the config: cp /etc/VRTSvcs/conf/config/main.cf main.cf.orig.
Check status: hastatus -summary, then switch any online groups off the leaving node with hagrp -switch <group> -to <node>.
Make config writable (haconf -makerw) and stop VCS on the leaving node: hastop -sys <node>.
Remove the node from each group's SystemList: hagrp -modify <group> SystemList -delete <node>.
Delete the node: hasys -delete <node>, then haconf -dump -makero.
Update /etc/llthosts, /etc/llttab and /etc/gabtab on the remaining nodes.
On the leaving node, unconfigure GAB and LLT: gabconfig -U then lltconfig -U, disable their startup, and remove the VCS packages.

Troubleshooting questions in a VCS interview

How do you clear a faulted resource?

First fix the underlying problem. For non-persistent resources, clear the fault and re-probe: hares -clear <resource> -sys <node> followed by hares -probe <resource> -sys <node>. For persistent resources, simply wait for the next OfflineMonitorInterval (default 300 seconds) and the agent will pick up the restored state.

How do you handle a resource stuck in ADMIN_WAIT?

If a group's ManageFaults attribute is NONE, VCS takes no automatic action on a fault and parks the resource in ADMIN_WAIT pending an administrator. Clear it without faulting the group by re-probing (hares -probe <resource> -sys <node>), or move it to OFFLINE|FAULTED with hagrp -clearadminwait -fault <group> -sys <node>.

What is flushing a service group and when is it needed?

Flushing clears internal wait states when agents appear hung waiting for resources to go online or offline. It stops VCS from continuing to attempt the online/offline: hagrp -flush <group> -sys <node>.

What is GAB seeding and when is manual seeding required?

/etc/gabtab defines the minimum number of nodes that must be communicating before VCS will start (e.g. gabconfig -c -n 2). That threshold is the seed. If you must start a cluster with fewer nodes than the seed (for example during planned maintenance), you seed manually on each running node with gabconfig -c -x.

Can different VCS versions run in the same cluster?

No. Mixed VCS versions, and even mixed patch levels, are not supported in a running cluster simultaneously. Plan a rolling upgrade procedure and apply patches consistently. For a brief mixed window only the documented rolling-upgrade method is supported.

What are the VCS user privilege levels?

Cluster Administrator — full privileges.
Cluster Operator — all cluster, group and resource operations.
Cluster Guest — read-only (the default for new users).
Group Administrator — all operations on a specific group except deleting it.
Group Operator — online/offline and freeze/unfreeze a specific group.

What are the failover policies?

The FailOverPolicy attribute chooses the target node: Priority (lowest priority number wins, the default), RoundRobin (node with the fewest active groups), and Load (node with the greatest available capacity). Set it with hagrp -modify <group> FailOverPolicy Load.

Note on the platform and modern equivalents

Many older interview sets reference Solaris-specific steps (modunload, /etc/rc2.d/S70llt startup links) and a web-based Cluster Management Console on ports 8181/8443. That console was deprecated; current administration uses the CLI and the Veritas InfoScale Operations Manager (VIOM) GUI. The Java console (/opt/VRTSvcs/bin/hagui) also belongs to older releases. The CLI commands above remain valid across VCS and InfoScale Availability, which is the supported product line today.

Key Takeaways

LLT carries heartbeats and traffic, GAB manages membership and atomic broadcast, HAD is the engine monitored by hashadow — memorize this trio cold.
VCS supports up to 8 LLT links and 32 nodes; always run at least two interconnects to avoid jeopardy.
Split brain (all links lost) risks data corruption; I/O fencing with SCSI-3 PGR and an odd number of coordination points prevents it.
Know the hastop variants: -all -force stops the engine cluster-wide while leaving applications running.
Always fix the root cause before hares -clear, and remember haconf -makerw / haconf -dump -makero bracket every config change.

Frequently Asked Questions

What is the difference between switchover and failover in VCS?

A switchover is a planned, orderly move: VCS cleanly stops the application and its resources on one node and starts them on another, usually for maintenance. A failover is the unplanned version triggered by a fault or lost heartbeat, where an orderly shutdown on the original node may not be possible, so services are simply brought up on a surviving node.

Where is the VCS main configuration file and how do I verify it?

The main configuration file is main.cf in /etc/VRTSvcs/conf/config. Verify its syntax with hacf -verify /etc/VRTSvcs/conf/config before starting the cluster, and always make a backup copy before editing.

How do I check VCS cluster status quickly?

Use hastatus -sum for a one-screen summary of every node, service group and resource. For live GAB membership use gabconfig -a, and for detailed LLT link state use lltstat -nvv.

Is Veritas Cluster Server still used today?

Yes. The technology continues under the name Veritas InfoScale Availability (the former Symantec/Veritas Cluster Server), and the LLT/GAB/HAD architecture and CLI commands covered here remain current, which is exactly why these questions still appear in interviews.

If this breakdown helped you prepare, subscribe to @explorenystream on YouTube for more system administration and high-availability walkthroughs.