How to troubleshoot network connectivity problems

Networking connectivity is a key requirement in today’s computing world. As any good system administrator knows, planning for failure is part of the job. And yet, regardless of how much redundancy you build into your setup, there’s always the potential for unexpected issues. You must know the tools and procedures to help you solve unpleasant surprises.

For many sysadmins, troubleshooting is the fun part of the job (my measure of “fun,” anyway), and one day at work, I had the opportunity to put my troubleshooting abilities to the test. On a server in my lab, I noticed entries in the logs revealing intermittent connectivity. I was surprised because I’d implemented a fair amount of redundancy in my setup, so I decided to investigate.

Using bonding or teaming configurations, you can configure a Red Hat Enterprise Linux (RHEL) server to use multiple networking switch ports for added performance and redundancy. Depending on the networking switch’s capabilities, there are various possible configurations. Assuming the networking switch can perform an 802.3ad link aggregation group (LAG), you can logically bundle multiple network interfaces on the networking switch to the RHEL server using multiple server networking interface cards (NICs) into a bonding or teaming device.

Mục Lục

The configuration

Here’s a visual of what my server network looks like. Refer back to this during the troubleshooting steps. Specifically, notice the team1 and team10 NIC teaming configurations. The server uses team1 for data connectivity and team10 for storage connectivity. If you wish to reproduce the setup, you can find the configuration scripts at the end of the article.

Network setup (Marc Skinner, CC BY-SA 4.0)

First, I validated the configuration using these commands:

$ sudo nmcli con
$ sudo teamdctl team1 state view
$ sudo teamdctl team10 state view
$ sudo teamnl team1 ports
$ sudo teamnl team10 ports

If both NICs in one of the teaming devices have issues, network pings stop. But what if one of the NICs has a problem but not the other? In that event, things continue to function, which is exactly the point of a networking team device configuration, so how can you tell if there’s an issue? What do you look for? Assuming you have configured network bandwidth graphing using a tool like Grafana, would your networking capacity graph show something of interest? Would the graph show half the capacity being used? Would you even notice?

[ Plan your path to network automation. Download the Network automation for everyone eBook. ]

Display network card state with ethtool

It’s a good idea to monitor the network card link state. Depending on your monitoring software, you may or may not have that capability. RHEL has a few ways to check on link state including ethtool and ip.

Here is an example of ethtool:

$ sudo ethtool enp9s0
Settings for enp9s0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: Unknown (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: no

And here is another example, this time using the ip command:

$ sudo ip link show enp9s0
4: enp9s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master team1 state DOWN mode DEFAULT group default qlen 1000
    link/ether 68:05:ca:36:42:34 brd ff:ff:ff:ff:ff:ff

You can see Link detected: no or state DOWN in both the commands above. But what if my monitoring software checks every five minutes, or even every three minutes, for a down link state? Would it catch a NIC with intermittent connectivity every 15 to 20 seconds or some random amount of time?

Check link state consistency

How can you determine whether the connection is intermittent? You can look at the down count for enp9s0.

As you can see from the output of this teamdctl command, 8241 has a lot of ups and downs.

$ sudo teamdctl team1 state view
setup:
  runner: lacp
ports:
  enp2s0
    link watches:
      link summary: up
      instance[link_watch_0]:
        name: ethtool
        link: up
        down count: 0
    runner:
      aggregator ID: 2, Selected
      selected: yes
      state: current
  enp9s0
    link watches:
      link summary: down
      instance[link_watch_0]:
        name: ethtool
        link: down
        down count: 8241
 [...]

Where else can you look for this information? What about log files? Is the system logging any of this?

$ sudo grep team1 /var/log/messages* | head -n9
messages:Mar 28 18:02:29 bk1 teamd_team1[1810]: enp9s0: Changed port state: "disabled" -> "expired"
messages:Mar 28 18:02:29 bk1 teamd_team1[1810]: enp9s0: ethtool-link went up.
messages:Mar 28 18:02:31 bk1 teamd_team1[1810]: enp9s0: Changed port state: "expired" -> "disabled"
messages:Mar 28 18:02:31 bk1 teamd_team1[1810]: enp9s0: ethtool-link went down.
messages:Mar 28 18:02:36 bk1 teamd_team1[1810]: enp9s0: Changed port state: "disabled" -> "expired"
messages:Mar 28 18:02:36 bk1 teamd_team1[1810]: enp9s0: ethtool-link went up.
messages:Mar 28 18:02:38 bk1 teamd_team1[1810]: enp9s0: Changed port state: "expired" ->

The network link status is flapping every two to three seconds. This is a lab server, and it doesn’t have all of its monitoring set up and configured correctly.

[ Free online course: Transitional approach to implementing pragmatic site reliability engineering (SRE) ]

I stumbled onto it because I have a habit of running dmesg -T whenever I log into a system. dmesg stands for diagnostic messages, and the command prints out the kernel’s message buffer. The -T option prints the time stamp when the event occurs. dmesg messages all get logged and written to disk in a log file for safekeeping.

Solve the problem

Here’s a picture of the source of the problem: A slightly stretched networking cable was causing the port to go up and down many times a minute.

Twisted pair with damaged RJ-45 (Marc Skinner, CC BY-SA 4.0)

Here are a few takeaways:

Don’t assume fault tolerance doesn’t need to be monitored.
Understand where error messages go and what keywords should be monitored.
Spot check error logs for terms like “error,” “warning,” and “fail.” Are you missing something?
Manually fail components configured for redundancy to see what happens. What gets logged? How does the failure affect the system and performance? Know ahead of time what you should be looking for.
Set up monitoring and alerting for everything important.
When an issue happens, do a root cause analysis to understand the cause better.

With my issue, looking through my logs pointed to a date when I physically moved my rack, which is on wheels. I remember moving it out to do some maintenance on another server, probably when the cable was stretched too much. A quick patch cable replacement and I was back in business.

Reference: Configuration scripts

Below are two NetworkManager nmcli scripts to configure two dual-port NICs into a teaming device using LACP. The first dual-port 1GB NIC is configured as team1 and the other as team10, a dual-port 10GB NIC.

Here is the team1 configuration script:

#!/bin/bash

NIC1=enp2s0
NIC2=enp9s0
HOSTIP=192.168.33.50

nmcli con add type team ifname team1 con-name team1
nmcli connection modify team1 team.config '{"runner": {"name": "lacp", "active": true, "fast_rate": true, "tx_hash": ["ipv4","tcp","udp"]}, "link_watch": {"name": "ethtool"}, "tx_balancer": { "name": "basic"}}'
nmcli con add type ethernet con-name team1-$NIC1 ifname $NIC1 master team1
nmcli con add type ethernet con-name team1-$NIC2 ifname $NIC2 master team1
nmcli connection modify team1 ipv4.addresses $HOSTIP/24
nmcli connection modify team1 ipv4.gateway 192.168.33.1
nmcli connection modify team1 ipv4.dns 192.168.33.31
nmcli connection modify team1 ipv4.dns 192.168.33.32
nmcli connection modify team1 ipv4.method static
nmcli connection modify team1 ipv4.dns-options rotate,timeout:1
nmcli connection modify team1 ipv4.dns-search "i.skinnerlabs.com"
nmcli connection modify team1 connection.autoconnect "yes"
nmcli connection modify team1 ipv6.method ignore

The team10 configuration script looks like this:

#!/bin/bash

NIC1=enp6s0f0
NIC2=enp6s0f1
HOSTIP=192.168.126.50

nmcli con add type team ifname team10 con-name team10
nmcli connection modify team10 team.config '{"runner": {"name": "lacp", "active": true, "fast_rate": true, "tx_hash": ["ipv4","tcp","udp"]}, "link_watch": {"name": "ethtool"}, "tx_balancer": { "name": "basic"}}'
nmcli con add type ethernet con-name team10-$NIC1 ifname $NIC1 master team10
nmcli con add type ethernet con-name team10-$NIC2 ifname $NIC2 master team10
nmcli connection modify team10 ipv4.addresses $HOSTIP/24
nmcli connection modify team10 ipv4.method static
nmcli connection modify team10 connection.autoconnect "yes"
nmcli connection modify team10 ipv6.method ignore