DevOps · K8s · Volleyball · Travel  •  DevOps · K8s · Volleyball · Travel  •  DevOps · K8s · Volleyball · Travel
Explore NY Stream

OpenShift Join New Node to Cluster

— ny_wk

OpenShift Join New Node to Cluster
🛒 Recommended gear on Amazon

Disclosure: some links above are affiliate links — if you buy through them I may earn a small commission at no extra cost to you. Thanks for supporting the channel!

Ever felt the pinch of an overloaded OpenShift cluster? When your applications demand more muscle, or you're gearing up for a major deployment, adding new nodes becomes not just a necessity but a strategic move. Scaling your cluster isn't just about throwing hardware at the problem; it's about intelligent, systematic expansion to ensure your applications run smoothly and efficiently. This comprehensive guide will walk you through the nitty-gritty of how to OpenShift join new node to cluster, transforming a bare metal server or VM into a fully integrated OpenShift worker, ready to host your containers.

For anyone managing an OpenShift environment, integrating a new node can seem like a daunting task with multiple steps across storage, networking, and OpenShift configurations. But trust me, once you understand the underlying principles and follow a structured approach, it's quite manageable. Think of it like this: you're preparing a new team member for a critical project – they need their desk set up, tools installed, and proper access rights before they can start contributing. Similarly, a new OpenShift node needs meticulous preparation to ensure it integrates seamlessly and performs optimally.

Laying the Groundwork: Prerequisites and Initial Node Setup

Before we even think about touching OpenShift commands, the new node itself needs a solid foundation. This is where most of the initial configuration happens. Assume your new VM or physical server is already provisioned and has a compatible operating system installed – typically RHEL or CentOS, depending on your OpenShift version. Compatibility is key here, bhai! An old OS version can cause all sorts of headaches later on with Docker or CRI-O.

Essential Node Preparation: Disk Management and SSH Access

First up, storage. OpenShift nodes, especially worker nodes, can be quite hungry for disk space, particularly in the /var/ and /var/log/ directories. Docker images, container layers, and extensive logging can quickly consume available space. You might encounter two main scenarios for storage expansion:

Scenario 1: Extending Existing Logical Volumes (LVM)

If your node already uses LVM and has free space within its volume group, extending existing logical volumes is straightforward. We'll target /var/ and /var/log/ as they are critical areas for OpenShift operations. Remember, you need to become root first for all these operations:

sudo su -

Now, let's extend /var/log/ by 10GB and /var/ by 80GB. Adjust these values based on your specific requirements:

lvextend -L +10G /dev/vg1/lv_var_log
xfs_growfs /dev/vg1/lv_var_log

lvextend -L +80G /dev/vg1/lv_var
xfs_growfs /dev/vg1/lv_var

Here, lvextend increases the size of the logical volume, and xfs_growfs (assuming XFS filesystem, which is common on RHEL/CentOS) resizes the filesystem to utilize the newly available space. Don't forget the xfs_growfs part, otherwise, your OS won't see the new space!

Scenario 2: Adding a New Hard Disk Drive (HDD)

Sometimes, simply extending existing LVs isn't enough, or you might not have free space in the current VG. In such cases, adding a new physical disk (or a virtual disk in a VM environment) is the way to go. If you're on a VM, ensure the new disk is configured as Thick Provision Eager Zeroed for optimal performance and to avoid performance penalties later. This type pre-allocates and zeroes out the entire disk space upon creation.

After adding the disk to your VM, it might not immediately show up in lsblk or fdisk -l. You'll need to rescan your SCSI bus to detect it. This is a common situation, especially in virtualized environments:

# Rescan SCSI hosts
for x in 0 1 2; do echo "- - -" > /sys/class/scsi_host/host${x}/scan; done

# Rescan SCSI devices
find /sys/class/scsi_device/ -mindepth 1 -maxdepth 1 | while read f; do echo 1 > "${f}/device/rescan"; done

These commands effectively tell the kernel to look for new or changed SCSI devices. Once detected, your new disk will likely appear as /dev/sdb (or sdc, etc.). Always verify with lsblk.

Next, we need to partition the new disk. We'll use parted for this, which is great for GPT partition tables (the modern standard):

parted /dev/sdb
# (parted) mklabel gpt
# (parted) mkpart primary 2048s -1
# (parted) quit

This sequence creates a GPT partition table and a single primary partition spanning almost the entire disk (from 2048 sectors to the very end, -1). After partitioning, create a physical volume, extend your existing volume group (vg1 in this case) with the new physical volume, and then extend your logical volumes:

pvcreate /dev/sdb1 && \
vgextend vg1 /dev/sdb1 && \
lvextend -L +10G /dev/vg1/lv_var_log && \
xfs_growfs /dev/vg1/lv_var_log && \
lvextend -l +100%FREE /dev/vg1/lv_var && \
xfs_growfs /dev/vg1/lv_var

This command chain is quite powerful! It creates a Physical Volume (PV) on /dev/sdb1, adds it to vg1, then extends lv_var_log by 10GB, and finally extends lv_var to consume *all* remaining free space in vg1. Don't forget to resize the filesystems after each lvextend.

Secure and Automated Access: SSH Configuration

For Ansible to manage this node efficiently during the OpenShift scale-up, it needs passwordless root SSH access from your OpenShift bootstrap/Ansible control node. While PermitRootLogin without-password is convenient for automation, remember it's a security consideration. In production, you might prefer using a dedicated unprivileged user with sudo access and SSH keys. However, for the purpose of this guide and initial setup, we'll follow the provided method:

sed -i 's/PermitRootLogin.*/PermitRootLogin without-password/g' /etc/ssh/sshd_config
systemctl restart sshd

This modifies the SSH daemon configuration to allow root login using SSH keys, without a password. Then, you need to add the public SSH key from your OpenShift bootstrap node (the one running Ansible) to the new node's authorized_keys:

mkdir -p .ssh && chmod 700 .ssh
touch .ssh/authorized_keys && chmod 600 .ssh/authorized_keys
echo <serverAbcs1_ssh_key> >> .ssh/authorized_keys

Replace <serverAbcs1_ssh_key> with the actual public key string. The chmod commands are crucial for security; SSH won't accept keys if the permissions are too permissive. This setup allows Ansible to connect as root and execute commands remotely.

System Updates and OpenShift-Specific Tweaks

Keeping your system updated is crucial. It prevents compatibility issues, especially with container runtimes like Docker, and addresses security vulnerabilities:

yum update -y

This command updates all installed packages to their latest versions. It's a good practice to run this early to ensure a stable base.

Next, install essential Python libraries and other utilities that OpenShift Ansible playbooks often rely on:

yum install -y pyOpenSSL python-rhsm-certificates jq python-configparser rng-tools python2-passlib

These packages provide various functionalities: pyOpenSSL for SSL/TLS, python-rhsm-certificates for RHEL subscriptions, jq for JSON parsing, python-configparser for configuration handling, rng-tools for entropy generation, and python2-passlib for password hashing.

OpenShift has some specific requirements and recommendations for node configuration to ensure stability and avoid common pitfalls. These tweaks help prevent issues related to DNS resolution, container scanning, and network management:

  • DNS Resolution for Docker: OpenShift and Docker play around with /etc/resolv.conf. To prevent Docker from creating problems with DNS resolution, we often remove the immutable flag from /etc/resolv.conf:
    chattr -i /etc/resolv.conf

    This ensures that Docker can manage the file without permission errors. For more details on OpenShift DNS, you can refer to our article on OpenShift DNS Resolution Deep Dive.

  • SSH UseDNS: Disabling DNS lookups for SSH connections can speed up login times and prevent delays if your DNS servers are slow or unreachable:
    sed -i 's/^#\(UseDNS\).*$/\1 no/g' /etc/ssh/sshd_config && systemctl restart sshd

    This sets UseDNS no in sshd_config, ensuring SSH doesn't perform reverse DNS lookups on incoming connections.

  • Removing unowned_files Cron Job: Some security lockdown features, like cron jobs scanning for unowned files, can interfere with Docker containers. These often flag container filesystems as "unowned," causing issues. It's best to remove it:
    /bin/rm -f /etc/cron.daily/unowned_files
  • NetworkManager Service: OpenShift's CNI (Container Network Interface) plugins often rely on NetworkManager. Ensuring it's enabled and running is vital for network connectivity within the cluster:
    systemctl enable NetworkManager && systemctl start NetworkManager
  • Entropy Generation (RNGD): Running many pods can deplete the system's entropy pool, impacting security-sensitive operations (like SSL/TLS handshakes) and causing delays. rngd helps maintain a healthy entropy pool:
    systemctl enable rngd && systemctl start rngd

Advanced Networking: NIC Bonding and VLAN Configuration

For high-performance environments or specific use cases (like CUDAML mentioned in the source), NIC bonding is critical for redundancy and increased throughput. We'll set up an LACP (Link Aggregation Control Protocol) bond (mode 4) with two network interfaces (em1 and em2) and then configure a VLAN interface on top of it. This ensures your node has robust and high-bandwidth network connectivity.

First, create the bond interface ifcfg-bond0:

cat > /etc/sysconfig/network-scripts/ifcfg-bond0 << EOF
DEVICE=bond0
TYPE=Bond
NAME=bond0
BONDING_MASTER=yes
BOOTPROTO=none
ONBOOT="yes"
IPV6INIT="no"
NM_CONTROLLED=no
BONDING_OPTS="mode=4 miimon=100 lacp_rate=1"
EOF

Here, mode=4 specifies LACP, miimon=100 sets the link monitoring frequency, and lacp_rate=1 requests fast LACP packets. NM_CONTROLLED=no ensures NetworkManager doesn't interfere with this interface, as we are managing it manually.

Next, transfer IP configuration from an existing interface (e.g., em1) to the bond interface and add the ZONE information:

egrep IPADDR /etc/sysconfig/network-scripts/ifcfg-em1 >> /etc/sysconfig/network-scripts/ifcfg-bond0
egrep GATEWAY /etc/sysconfig/network-scripts/ifcfg-em1 >> /etc/sysconfig/network-scripts/ifcfg-bond0
egrep NETMASK /etc/sysconfig/network-scripts/ifcfg-em1 >> /etc/sysconfig/network-scripts/ifcfg-bond0
echo 'ZONE=public' >> /etc/sysconfig/network-scripts/ifcfg-bond0

Then, configure the physical interfaces (em1 and em2) as slaves to the bond:

cat > /etc/sysconfig/network-scripts/ifcfg-em1 << EOF
NAME=em1
DEVICE=em1
BOOTPROTO=none
ONBOOT=yes
NM_CONTROLLED=no
IPV6INIT=no
MASTER=bond0
SLAVE=yes
EOF

cat > /etc/sysconfig/network-scripts/ifcfg-em2 << EOF
NAME=em2
DEVICE=em2
BOOTPROTO=none
ONBOOT=yes
NM_CONTROLLED=no
IPV6INIT=no
MASTER=bond0
SLAVE=yes
EOF

If you require VLAN tagging, create a VLAN sub-interface (e.g., bond0.160):

cat > /etc/sysconfig/network-scripts/ifcfg-bond0.160 << EOF
DEVICE=bond0.160
NAME=bond0.160
BOOTPROTO=none
IPADDR="10.17.160.148"
NETMASK="255.255.252.0"
ONBOOT="yes"
VLAN=yes
IPV6INIT="no"
NM_CONTROLLED=no
ZONE=public
EOF

Finally, load the 802.1q kernel module (for VLANs) and restart the network service to apply changes:

modprobe --first-time 8021q
systemctl restart network

After these steps, your node should have robust network connectivity, ready for OpenShift.

Integrating the Node into Your OpenShift Cluster: The Ansible Way

Now that our new node is meticulously prepared, it's time to introduce it to the OpenShift cluster. This is where Ansible, the automation workhorse, steps in. OpenShift 3.x relied heavily on Ansible playbooks for installation and scaling, making it super efficient to add nodes.

Updating the Ansible Inventory

You'll need to SSH into your OpenShift bootstrap or Ansible control server, where your OpenShift-Ansible playbooks reside. This is usually the same server you used to install OpenShift initially.

ssh serverAbcs01.xyz.domain
sudo su -
cd ~/ansible

Locate your Ansible inventory file (e.g., hosts-3.11). This file defines all your cluster hosts and their roles. You need to add your new node's Fully Qualified Domain Name (FQDN) under the appropriate group, usually [new_nodes] or directly under [nodes] with specific labels.

vi hosts-3.11

Add the new node's details. It's good practice to assign labels directly in the inventory, which OpenShift uses for scheduling workloads:

# Example addition to your hosts file:
[new_nodes]
mynewnode.yourdomain.com openshift_node_labels='{"node-role.kubernetes.io/worker":"true", "region":"east", "app":"general"}' openshift_hostname=mynewnode.yourdomain.com

The openshift_node_labels are crucial for effective workload placement using Kubernetes Affinity and Anti-Affinity rules. Define them carefully to match your cluster's labeling strategy.

Executing the Scale-Up Playbook

With the inventory updated, it's time to unleash the Ansible playbook. The openshift-node/scaleup.yml playbook is specifically designed for adding new nodes to an existing OpenShift cluster. This playbook handles all the heavy lifting: installing Docker/CRI-O, configuring Kubelet, setting up the CNI network plugin, and registering the node with the OpenShift master(s).

ansible-playbook -i hosts-3.11 ~/openshift-ansible-3.11/playbooks/openshift-node/scaleup.yml

The -i flag specifies your inventory file. Let Ansible do its magic. This process can take a while, depending on the number of nodes, network speed, and package installation times. Grab a chai, take a break! Ansible will provision, configure, and integrate your new node.

Verification and Troubleshooting: Ensuring Your Node is "Ready"

After the Ansible playbook completes, it's crucial to verify that the new node has successfully joined the cluster and is in a "Ready" state. This is where we check if all that hard work has paid off.

Verifying Node Status

SSH to one of your OpenShift master nodes to use the oc (OpenShift Client) command-line tool:

ssh serverAbcm01.xyz.domain
sudo su -
oc get nodes

You should see your new node listed. Ideally, its status should be "Ready". If it shows up as "NotReady" or is missing, something needs troubleshooting. Another useful command is to describe the node:

oc describe node <your_new_node_fqdn>

This command provides a wealth of information about the node's health, allocated resources, conditions, and events, which are invaluable for debugging.

Troubleshooting "Not Ready" Nodes

A common issue is a node showing "Not Ready" with an error message like: Error 'NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized'. This typically points to a networking or DNS problem within the node or how OpenShift's CNI is configured.

One primary culprit for this is incorrect DNS resolution on the node, especially for the internal OpenShift services. OpenShift uses its own internal DNS, and nodes need to correctly resolve internal cluster names. You might need to explicitly configure the node's DNS resolution to point to your cluster's internal DNS server:

cat > /etc/origin/node/resolv.conf <

Replace 192.168.20.10 with the IP address of your OpenShift internal DNS server (often an OpenShift router IP or a dedicated DNS server). This file is used by the OpenShift node services for DNS resolution.

Additionally, for environments using dnsmasq, ensure it's configured to forward requests to the correct upstream DNS server:

cat > /etc/dnsmasq.d/origin-upstream-dns.conf << EOF
server=192.168.20.10
EOF

After making these DNS changes, you might need to restart the relevant OpenShift services or even reboot the node, depending on the specific issue and your OpenShift version. For instance, restarting kubelet and docker/crio might be necessary:

systemctl restart kubelet
systemctl restart docker # or crio for newer OpenShift versions

Always check the logs of these services (journalctl -xeu kubelet, journalctl -xeu docker) for more detailed error messages if the issue persists.

Best Practices and Post-Integration Steps (ज़रूरी Baatein)

Adding a node is not just about getting it "Ready"; it's about ensuring it contributes effectively and securely to your cluster's health and performance. Here are some essential best practices and next steps:

  • Affinity Rules: Revisit your workload's pod affinity and anti-affinity rules. The new node's labels (e.g., region, app, node-role.kubernetes.io/worker) can be leveraged to intelligently schedule pods, ensuring high availability and optimal resource utilization.
  • Resource Management: Monitor the new node's resource utilization (CPU, memory, disk I/O, network) closely. OpenShift's metrics and monitoring tools like Prometheus and Grafana are your best friends here. Adjust resource quotas and limits if necessary to prevent resource starvation.
  • Security Patches: Integrate the new node into your existing patching and vulnerability management cycles. Regular updates are critical for maintaining a secure and stable environment.
  • Backup Strategy: Ensure your cluster's backup strategy now includes the configuration and data associated with the new node.
  • Documentation: Update your infrastructure documentation to reflect the new node's details, including its IP, FQDN, hardware specifications, and any unique configurations.
  • Capacity Planning: Use the addition of this node as an opportunity to review your overall capacity planning. Are you adding enough nodes? Do you have a scalable strategy for future growth?
  • Automate More: If you find yourself repeatedly performing manual steps, explore further automation with Ansible or other Infrastructure as Code (IaC) tools. This reduces human error and speeds up future scaling operations.

Integrating a new node into an OpenShift cluster is a multi-stage process that requires careful planning and execution. From preparing the underlying operating system and storage to configuring networking and leveraging Ansible for orchestration, each step is crucial. By following this detailed guide, you can confidently expand your OpenShift cluster, enhancing its capacity, resilience, and performance. Remember, a well-prepared node is a happy node, and a happy node means happy applications!

Key Takeaways

  • Thorough Node Preparation is Key: Before OpenShift integration, ensure disk space is sufficient, SSH access is configured, and all necessary OS-level tweaks are applied.
  • LVM and Disk Expansion: Learn to extend existing LVMs or add new physical/virtual disks using tools like lvextend, xfs_growfs, and parted.
  • OpenShift-Specific Tweaks: Crucial configurations like chattr -i /etc/resolv.conf, UseDNS no, and enabling NetworkManager and rngd prevent common OpenShift operational issues.
  • Ansible for Automation: OpenShift's openshift-ansible playbooks are indispensable for adding nodes, handling package installations, and configuring services automatically.
  • Verification and Troubleshooting: Always verify node status with oc get nodes and be prepared to troubleshoot common issues like NetworkPluginNotReady by checking DNS and service logs.

Frequently Asked Questions

Why do I need to extend /var and /var/log when adding an OpenShift node?

/var/ is where Docker (or CRI-O) stores container images, layers, and volumes. As you deploy more applications and pull various images, this directory can grow very large. /var/log/ stores system and application logs. OpenShift nodes generate extensive logs, especially under heavy load. Insufficient space in these partitions can lead to disk full errors, node instability, and application failures.

What is NIC bonding and why is it used in OpenShift nodes?

NIC bonding (Network Interface Card bonding) combines multiple physical network interfaces into a single logical interface. It's used to provide network redundancy (if one NIC fails, the bond keeps functioning) and increased throughput (load balancing network traffic across multiple NICs). In OpenShift, especially for high-traffic worker nodes or those with specific performance requirements (like CUDAML), bonding ensures reliable and high-bandwidth network connectivity for inter-pod communication and external traffic.

What does the error NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized mean?

This error indicates that the Container Network Interface (CNI) plugin, which is responsible for networking between pods, hasn't been properly initialized on the node. This often happens due to issues with the node's network configuration, DNS resolution, or problems with the CNI plugin's daemon (e.g., Flannel, OVN-Kubernetes) failing to start correctly. Troubleshooting usually involves checking network service status, DNS settings (especially /etc/origin/node/resolv.conf), and relevant service logs (e.g., kubelet, docker).

Can I add a node to OpenShift without using Ansible?

While technically possible to manually configure all components (Docker/CRI-O, Kubelet, CNI, certificates, etc.), it's extremely complex, error-prone, and not recommended, especially for OpenShift 3.x. The OpenShift-Ansible playbooks are designed to automate this intricate process, ensuring consistency, correct configuration, and proper integration with the cluster's control plane. For OpenShift 4.x (OKD), node management is primarily handled by the Machine API and Operators, moving even further away from manual configurations.

If you're looking for a more visual walkthrough, or just want to see these commands in action, make sure to check out the original video on @explorenystream. Don't forget to like and subscribe for more insightful DevOps content!