Where and where not to use QoS

traffic congestion

Some recent QoS debates prompted me to ponder. I’d like to offer some of my thoughts about QoS in the hopes that you’ll find them beneficial.

Perhaps the beginning point here is that many sites do not appear to be doing QoS. Or perhaps that’s just a sample of the sites I’ve visited recently. Understanding what QoS can accomplish for you may aid in this endeavor.

Some network circumstances need the use of QoS. It is required when bandwidth cannot be over-provisioned and there is potential for bandwidth congestion.

Here are some examples of frequent scenarios when QoS might be beneficial:

Upstream aggregation: If you have a lot of switch ports sending traffic upstream, and 48 x 1 G ports are running at 50% capacity, you may have 24 Gbps of data attempting to escape on 1 or 2 10 Gbps lines. Something’s got to give! QoS allows you to choose winners and losers – more specifically, which traffic receives priority and which is discarded or slightly delayed when there is congestion.

Downstream de-aggregation: If you have traffic blasting down a 10 Gbps link that must escape through a 1 Gbps link, such as to a device, any surplus traffic must be delayed or discarded. TCP will assist in adjusting TCP flows depending on drops. However, you will most likely wish to safeguard your VoIP and video against dropouts or queuing delays. (I’d use the comparison of a superhighway exit ramp here, but surplus cars don’t magically vanish!)

Shaping: If traffic enters your WAN/MAN router at 1 Gbps and needs to exit at a 1 Gbps port while you’re paying for, say, 500 Mbps capacity, this is analogous to downstream delay. You can generate more than 500 Mbps, but the provider will most likely enforce the agreed bandwidth, so they’ll choose what to drop. If they support QoS, they will at the very least pay heed to your priorities, as shown by the QoS DSCP marks. QoS shaping allows you to slow down your output to the contractual rate by queuing and dropping according to your preferences.

And here’s one example when QoS can’t help:

No-QoS in the Provider: Assume you have a low-cost L2 or L3 WAN/MAN provider linking your remote locations to the main site. The multiple customer flows will inevitably combine at several places in the provider network. If the total outgoing traffic on any provider’s internal network interface exceeds the available bandwidth, traffic will be dropped at random, regardless of the QoS DSCP bits you specify. That is to say, your priorities are not their priorities; they don’t have any. In other words, your priority traffic may be discarded because another client is generating a lot of traffic.


The good news is that L2 switch-based providers may have 40 or 100 Gbps switch interconnections and be operating at aggregated customer traffic utilization levels where dropouts are uncommon, even if one or two customers are generating abnormally high amounts of traffic. Even with that, you may have good and terrible days. How much over-provisioning is a small or low-cost supplier likely to have, considering that it may have an impact on their profitability? How quickly can they add bandwidth when their reporting system (if they have one) reports that a link is consistently “hot”?

That brings up an important point regarding QoS: QoS is application quality insurance against poor days. If you deal with a supplier who does not provide QoS, you may save money and find that things operate OK most of the time. However, you and they have less influence over what occurs in the three scenarios (bullet points) listed above.

Loss Tolerant Traffic

Drop-tolerant traffic is another approach to think about QoS. TCP, to be specific. Drops in TCP transmission prompt the sender to slow down (via unacknowledged packets). As a result, part of what we do with QoS is to preserve “fragile” traffic (my phrase for it), such as VoIP and video. And then distribute the remaining bandwidth across drop-tolerant classes.

Yes, TCP-based video exists. I’d anticipate any retransmissions to result in a brief “freeze” of the video display. TCP video programs may lag for a few seconds before displaying content to allow for retransmission. Multiple lost packets might still cause issues. This might explain several hospital ultrasound difficulties that we spent considerable time resolving. (Discontinued because troubleshooting took too long due to the large number of devices in the route.) As a temporary remedy, we discovered that the video app worked better without QoS, and the switches will be changed shortly.)

When building QoS, I advocate separating outgoing bandwidth into percentages, which describe the ratio of bandwidth received by the various classes. As a result, when there is any free bandwidth, other classes can use it. Adding shaping and policing instructions per class limits traffic to that class, which might result in idle (wasted!) bandwidth. I like to shape only when the contractual rate is less than the line rate, such as 2 Gbps on a 10 Gbps link.

A “BULK” class is frequently used in QoS setups. (To use the original name.) The notion is that some traffic, such as file transfers, can be allotted, say, 1% of the available capacity. That is, if another program is sending, its traffic is prioritized. When there is spare bandwidth, the bulk traffic is allowed to transmit. For example, you may classify backups as BULK but dedicate, say, 10% of the bandwidth to ensure they are completed within 24 hours (based on experimentation, and realizing that backup traffic generally increases over time, so will take longer to complete). Also, be extremely cautious about duplicate traffic. (An unscheduled DB replication by the server crew might destroy your day! – When any single flow can consume a considerable portion of the available space, planning is required.

Real World Scenarios

The medical system. Remote clinics have MAN/WAN connections that are less than 1 Gbps (cost, availability). When radiography pictures are returned to the main site, VoIP (IP phones) are destroyed. When there is conflict, some packets are lost, slowing down picture file transfers. QoS might favor voice apps over image transmission. As a result, QoS can get traffic out into the WAN in good form.

If the WAN provider does not provide QoS, the radiology traffic may need to be shaped or policed, limiting its consumption of bandwidth in the provider network. This is a foreshadowing of the above-mentioned merging flows situation.

Limiting radiology traffic isn’t ideal since it’s difficult to forecast and regulate what occurs on the provider network when simultaneous radiology uploads from various sites are taking place. If you limit the bandwidth, you may force radiology uploads from a certain site to be delayed when they don’t have to be. In such instances, doctors might get quite agitated and outspoken!

Conclusion: There isn’t much you can do to make up for a carrier that doesn’t supply QoS.

Working with QoS Complexity

It’s common for me to hear that configuring Cisco QoS is a pain. I’m not inclined to argue, but deploying QoS can take a significant amount of time and attention. The instructions also differ amongst Cisco equipment, but this has improved in recent years. (Except for Nexus QoS, which I regard to be merely a very strange QoS CLI – evidently related directly to the hardware capabilities.)

Obtain a license for Cisco Prime or DNAC QoS and use it to automate QoS deployment. This is a lot easier.

Alternatively, if you have a large deployment, templatize your QoS configuration.

Most sites do not appear to be employing DNAC yet, maybe because their switches are not due for replacement or because of COVID. Another reason is a failure to recognize the importance of QoS and the work involved in establishing and sustaining it.

QoS Implementation Design

Another aspect of QoS is a systematic design approach, since we need to “classify and mark” (C&M) inbound traffic in order to exploit the markings upstream.

The optimal location for C&M is the campus access switch, where we can use VoIP prioritization moving upstream. The WAN edge is frequently where we need “fancier” QoS. So one possibility is to first build WAN router QoS (C&M inbound, fancy QoS out to WAN), then retrofit the campus, or to presume the campus has enough of bandwidth (which I don’t suggest).

Another topic of debate is data center QoS. There is generally plenty of bandwidth available. But there are some significant flows as well.

WiFi, VPN, and so on are all distinct areas and concerns. When it comes to wireless, having the on-wire CAPWAP or other tunnel header capped at a low DSCP value can completely destroy, for example, guest wireless video. I’m not a fan of that technique, but that’s the norm, and you don’t want to read my tirade about it.

Another topic of debate is data center QoS. There is generally plenty of bandwidth available. But there are some significant flows as well.

WiFi, VPN, and so on are all distinct areas and concerns. When it comes to wireless, having the on-wire CAPWAP or other tunnel header capped at a low DSCP value can completely destroy, for example, guest wireless video. I’m not a fan of that technique, but that’s the norm, and you don’t want to read my tirade about it.

And, most recently, there’s QoS and VMware. The most important thing I know is that you want to ensure that your call manager, ISE, and other applications have plenty of CPU cycles (through shares) and interface bandwidth from the server chassis. (A word of caution here: if the VMware admin notices that your VM isn’t using its full share, he or she may cut the share amount to allow additional VMs to run on the VMware host. The application’s delayed reaction is an indication of this.)

I’m going to neglect certain other QoS elements of NSX for the time being. If you can configure QoS and DSCP marks by the VM, that’s fantastic. However, I’m fine with performing server-side C&M on the access data center switch, which also gives network-side verifiability and consistency. Simpler!

From an Operations Perspective

I’ll keep this brief. From an operations standpoint, you need to ensure that the configuration you planned was correctly deployed. Engineers deploying it WILL get bored and overlook buffer overruns or operator mistakes. In addition, while troubleshooting QoS, my first inquiry is generally, “Are we sure the QoS config hasn’t changed: config drift, compliance?” Both are unpleasant.

I’ll admit that my hands-on experience with DNAC QoS is currently limited, but my peers are enthusiastic about it. Because the prize might be so significant, I’m going forward with the preceding remark. It has the ability to save a significant amount of time and money, as well as allow you to do QoS without much experience, among other benefits.

Conclusion

I hope the information above has given you some tools to consider QoS and where it is most crucial.

To me, the most important thing is to strongly prefer employing percentages, bandwidth sharing, and defining relative priority.

When you start putting in Mbps or Gbps values for policing or shaping, you’re creating a Not To Exceed situation in which that class of traffic will be unable to use the free bandwidth even at night when there is no competing traffic.

Scroll to top

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

CiscoLessons will use the information you provide on this form to send occasional (less than 1/wk) updates and marketing.