Troubleshooting Increased Network Round Trip Time (NRTT)

cada102

Network Round Trip Time can be defined by the following equation:

NRTT = S_Delay + Q_Delay + R/SW_Delay + D_Delay + P_Delay

Where:

S_Delay

Serialization Delay – [(Frame size * 8)/(Access Rate)]
Q_Delay

Queue Delay – dependent on utilization and S_Delay
R/SW_Delay

Routing/Switch Delay – typically no greater than 1 ms per hop
D_Delay

Distance Delay – propagation delay due distance traveled. Typically 5ìs/km for fiber, 5.56ìs/km for copper, 3.3ìs/km satellite
P_Delay

Protocol Delay – delay added by transmission or higher level protocols

For example: CSMA/CD for shared Ethernet

Generally, increases in NRTT associated with an application are caused by an increase in any of the variables listed above. However, the typical reasons for an increase in NRTT are listed in the order they typically occur:

Increase in Q_Delay caused by increased utilization of a circuit, thus deep network queues
Increase in D_Delay because of carrier or enterprise fail-over to protected/redundant path that is longer in distance
Increase in R/SW_Delay because of network errors
Increase in S_Delay because of enterprise failover to redundant path with lower bandwidth

To determine if a network issue is limited to a single remote site, contrast and compare spikes in NRTT of the affected remote site with other sites that are comparable in terms of distance from server, bandwidth, and user count. If the NRTT increases or spikes among multiple sites at the same point in time, the issue could be carrier related or might be caused by instability in the routing protocol.

Increase in NRTT and Observations Count

An increase in the NRTT and in the Observations count is a strong indicator that a performance problem is based on an application utilization of the network. The strength of this indicator can be reinforced by correlating it with other corresponding data points to build a complete finding that the problem source is the network.

Network Round Trip Time

If both the NRTT and number of Observations peak at the same point in time as the observed performance issue, review the following data sets for the same point in time:

Components — [Retransmission Delay]Check whether the length of retransmissions increased. Retransmissions indicate that network queues are filling at a rate faster than they can be emptied, thus incurring packet drop and related TCP retransmissions
Sessions — [Connection Setup Time, TCP/IP Sessions]Check whether there is a concurrent increase in Network Connection Setup Time. This increase indicates the three-way TCP handshake is being delayed on the network because of queue depth being increased by other pre-established sessions within the network.Check whether the number of TCP/IP sessions increased by a significant number (greater than 10%). Additional TCP sessions and accompanying application data require more bandwidth.
Traffic — [Data Volume, Data Rate]Check whether the Data Volumes/Rates increased. Higher volumes of data on the network increase queue depth and related delay. Abnormal increases in data volumes that coincide with increases in NRTT indicate a network having difficulty keeping up with demand.
QoS — [Users]Check whether there is a significant increase in the number of users. Increases in network utilization typically coincide with increases in numbers of users. The point at which a certain number of users cause the NRTT to degrade can be interpreted as a future proactive point for upgrading network bandwidth for other similar sites.
Statistics — [Response Time Composition: Standard Deviation, Network Round Trip Time Percentiles]Check whether there was an increase in the standard deviation for NRTT and/or Percentiles. This increase indicates inconsistent and sporadic performance by the network as evidenced in more “outlying” data points (such as, points that are at significantly varying distances from the average), and is a strong indicator of network based issues.

Increase in NRTT and Decrease in Observations Count

An increase in the NRTT while the Observations count decreases is a possible indicator of two very different events:

Another application on the network is responsible for the increase in NRTT. This application might, or might not, be monitored by the management console.
The application service has become unreliable because of instability on the network (link failures, routing divergence, STP divergence, carrier failover, and other errors). This might result in sporadic loss of service or ultimately a complete service outage.

If the NRTT spikes while the Observations count dips at the same point in time as the observed performance issue, complete the following actions to determine which event mentioned above caused the degradation in performance.

Determine the Cause of a Degradation in Performance

Determine if another application is active on the network:

Click the Engineering page.
- Application — All
- Server — All
- Network Set – the relevant aggregation, such as the remote site aggregation as shown below
Select the following Settings from any Response Time view:
On the Response Time Composition view, click the blue hypertext Application link to see all applications that are being monitored by the management console on this server.If another application appears in the resulting Performance Map, complete these steps again for this application to determine if it is the problem source of the performance issue.The key is an increase in the number of observations at the same time a performance issue was reported.
- Replication or backup data between remote sites
- Large file transfers between servers
- Users streaming large amounts of data on subrate WAN links (< T1)
- Anti-virus upgrades across subrate WAN links
If no other application appears in the resulting Performance Map, use the back arrow to go back to the main Engineering page. Select Trends from the horizontal menu and check whether this performance event demonstrates a pattern over the past weeks and month. If a pattern appears you need to use historic NetFlow data, IP accounting, or protocol analyzer data to determine if an application was saturating the network at the times in question. If no historical data is available, you can manually review the time in question on a projected recurring time and date to identify the problem source application.Examples of applications that create issues for the primary application on a network are:

Determine if network has become unreliable:

Check the switch port facing the server and the server NIC to ensure that it is set for the correct duplex and speed settings (see the following table) and is free from errors.

Server

Switch

Result

Auto

Full duplex, auto speed

Auto

Manual

Half duplex, manual speed

Manual

Auto

Half duplex, manual speed

Manual – Full

Full duplex, manual speed (Assumes same speed is set on both ends)

Determine if the network has become unreliable because of configuration issues or network errors by setting up thresholds in the management console to launch an investigation when NRTT exceeds accepted values. The management console automatically gathers relevant information from routers and switches.
Review router and switch logs, interface errors, and change records to discover any events that might be affecting the stability of the network, such as routing divergence and circuit errors.