The Failure, part 3: Diggin’ the Kernel

This post is the fifth in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also introduced the Failure, covered its initial triaging, and looked at traffic captured during failed test runs.

All previous attempts to figure out what went wrong with the Failure hasn’t succeeded. In this post we will look closer at the Linux kernel code to learn about ARP table states, and to see if we can find any quirks in the kernel ARP layer that could explain the observations.

It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this “episode”.

Diggin’ the Kernel

Some of you may know that in my previous life, I was a Linux kernel developer (nothing fancy, mostly enabling embedded hardware). Though this short experience made myself more or less comfortable with reading the code, I figured I could use some help from the vast pool of Red Hat kernel developers. So I reached to Lance Richardson who, I was told, could help me figure out what’s going on with the Failure. And indeed, his help was enormous. In next several days, we discussed the kernel code on IRC, were digging old kernel mailing list archives, built and tested a bunch of custom kernels with local modifications to its networking layer. Here is what we’ve found.

Gratuitous ARP with arp_accept

Since RHEL7 kernel is quite old (3.10.0-514.22.1.el7 at the time of writing), we decided to start our search by looking at patches in Linus master branch and see if there were any that could be of relevance to the Failure, and that were not backported yet into RHEL7 kernel. The primary files of interest in the kernel source tree were net/ipv4/arp.c (the ARP layer) and net/core/neighbour.c (the neighbour layer which is an abstract representation of address-to-MAC mappings used for IPv4 as well as for IPv6).

Digging through the master branch history, the very first patch that drew our attention was: “ipv4: arp: update neighbour address when a gratuitous arp is received…” What the patch does is it forces override of an existing ARP table entry when a gratuitous ARP packet is received irrespective of whether it was received in locktime time interval. It is effective only when arp_accept is enabled, which is not the default. Anyway, that ringed some bell, and also suggested that maybe we dealt with a timing issue. The patch assumed arp_accept enabled and temporarily disabled the locktime behavior for gratuitous ARP packets, so let’s have a closer look at those two sysctl knobs.

Here is the documentation for the arp_accept sysctl knob. The setting controls whether the kernel should populate its ARP table with new entries on receiving ARP packets if the IP addresses are not registered in the table yet. Enabling the setting may be useful if you want to “warm up” the ARP table on system startup without waiting for the node to send its very first datagram to an IP address. The idea is that the kernel will listen for any ARP packets flying by, and it will create new table entries for previously unseen IP addresses. The default for the option is 0 (meaning off), and that’s for a reason. Enabling the feature may have unexpected consequences because the size of the kernel ARP table is limited, and in large network segments it may happen that the kernel will overflow the table with irrelevant entries due to the “warming up”. If that ever happens, the kernel may then start dropping some table entries that may still be useful. If that happens, you can see slowdown for some upper layer protocol connections for the time needed to restore the needed ARP entries using a round-trip of ARP probe packets. Long story short, the arp_accept setting is not for everyone.

As for locktime, there seems to be no in-tree documentation for the sysctl parameter, so the best source of information is probably arp(7). Quoting: “[locktime is] the minimum number of jiffies to keep an ARP entry in the cache. This prevents ARP cache thrashing if there is more than one potential mapping (generally due to network misconfiguration). Defaults to 1 second.” What it means is that if an ARP packet arrives during a 1 second interval since the previous ARP packet, it will be ignored. This is helpful when using ARP proxies where multiple network endpoints can reply to the same ARP REQUEST. In this case, you may want to ignore those replies that arrive later (to avoid so called ARP thrashing, as well as to stick to the node that is allegedly quicker/closer to the node).

With the above mentioned kernel patch, and arp_accept set to 1, the kernel should always update its ARP table if a gratuitous ARP packet is received, even if the entry is still in the locktime time interval.

Though arp_accept is not applicable for everyone, it was still worth exploring. I backported the patch into RHEL7 kernel, rebooted the tempest node, enabled arp_accept for eth2, and rerun the tests. Result? Same failures. So why hasn’t it worked?

Code inspection of neigh_update hasn’t revealed anything interesting. Everything suggested that override was still false. It took me awhile, but then it struck me: the code to determine whether an ARP packet is gratuitous considered frames of Request type, but not Reply. And Neutron L3 agent sends Replies (note the -A option instead of -U using in the arping command line)!
Here is how arping(8) defines those options:

-A     The same as -U, but ARP REPLY packets used instead of ARP REQUEST.
-U     Unsolicited ARP mode to update neighbours' ARP caches.  No replies are expected.

The next step was clear: let’s try to switch all Neutron L3 agents to gratuitous ARP requests and see if it helps. So I applied a one-liner to all controller nodes, restarted neutron-l3-agent services, and repeated the test run. It passed. I even passed multiple times in a row, first time in a long time I was banging my head over the issue!

OK, now I had a workaround. To pass tests, all I needed was:

  • Get a kernel that includes the patch (officially released as 3.14 on Mar 30, 2014);
  • Enable arp_accept for the external (eth2) interface;
  • Restart neutron-l3-agent services with the one-liner included.

But does it make sense that the kernel accepts gratuitous REQUESTs but not REPLYs? Is there anything in RFCs defining ARP that would suggest REPLYs are a different beast? Let’s have a look.

As we’ve learned in the very first post in the series, gratuitous ARP packets are defined in RFC 2002. Let’s quote the definition here in full.

-  A Gratuitous ARP [23] is an ARP packet sent by a node in order to
 spontaneously cause other nodes to update an entry in their ARP
 cache.  A gratuitous ARP MAY use either an ARP Request or an ARP
 Reply packet.  In either case, the ARP Sender Protocol Address
 and ARP Target Protocol Address are both set to the IP address
 of the cache entry to be updated, and the ARP Sender Hardware
 Address is set to the link-layer address to which this cache
 entry should be updated.  When using an ARP Reply packet, the
 Target Hardware Address is also set to the link-layer address to
 which this cache entry should be updated (this field is not used
 in an ARP Request packet).

So clearly both gratuitous ARP “flavors”, REQUEST and REPLY, are defined by the standard. There should be no excuse for the kernel to handle valid gratuitous REPLY packets in any other way than REQUESTs. To fix the wrongdoing, I posted a patch that makes the kernel to honor gratuitous REPLYs the same way as it does REQUESTs. (The patch is now merged in netdev master.)

Even though the kernel fix landed and will probably be part of the next 4.12 release, OpenStack Neutron still needs to deal with the situation somehow for the sake of older kernel, so it probably makes sense to issue REQUESTs from Neutron L3 agents to help those who rely on arp_accept while using an official kernel release. The only question is, should we issue both REQUESTs and REPLYs, or just REQUESTs? For Linux network peers, REQUESTs work just fine, but is there a risk that some other networking software stack honors REPLYs but not REQUESTs?.. To stay on safe side, we decided to issue both.

Anyhow, we discussed before that arp_accept is not for everyone, and there is a good reason why it’s not enabled by default. OpenStack should work irrespective of the sysctl knob value set on other network hosts, that’s why the patches mentioned above couldn’t be considered a final solution.

Besides, arp_accept only disabled locktime mechanism for gratuitous ARP packets, but we haven’t seen any ARP packets before the first gratuitous packet arrived. So why hasn’t the kernel honored it anyway?

Locktime gets in the way of gratuitous ARP packets

As we’ve already mentioned, without arp_accept enforcement, neigh_update hasn’t touched the MAC address for the corresponding ARP table entry. Code inspection suggested that the only case when it could happen was if arp_process passes flags=0 and not flags=NEIGH_UPDATE_F_OVERRIDE into neigh_update. And in the RHEL 7.3 kernel, the only possibility for that to happen is when all three gratuitous ARP replies would arrive in locktime time interval.

But they are sent with a 1-second interval between them, and the default locktime value is 1 second too, so at least the last, or even the second packet in the 3-set should have affected the kernel. Why hasn’t it?..

Let’s look again at how we determine whether an update arrived during the locktime:

override = time_after(jiffies, n->updated + n->parms->locktime);

What the quoted code does is it checks whether an ARP packet is received in [n->updated; n->updated + n->params->locktime] time interval, where n->params->locktime = 100. And what does n->updated represent?

If override is false, call to neigh_update will bail out without updating the MAC address or the ARP entry state. But see what it does just before bailing out: it sets n->updated to current time!

So what happens is that the first gratuitous ARP in the 3-series arrives in locktime interval; it calls neigh_update with flags=0 that updates n->updated and bails out. By moving n->updated forward, it also effectively moves forward the locktime window without actually handling a single frame that would justify that! The next time the second gratuitous ARP packet arrives, it’s again in the locktime window, so it again calls neigh_update with flags=0, which again moves the locktime window forward, and bails out. The exact same story happens for the third gratuitous ARP packet we send from Neutron L3 agent.

So where are we at the end of the scenario? The ARP entry never changed its MAC address to reflect what we advertised with gratuitous ARP replies, and the kernel networking stack is not aware of the change.

This moving window business didn’t seem right to me. There is a reason for locktime, but its effect should not take longer than its value (1 second), and this was clearly not what we’ve seen. So I poked the kernel a bit more and came up with a patch that avoids updating n->updated if neither entry state nor its MAC address would change on neigh_update. With the patch applied to RHEL kernel, I was able to pass previously failing test runs without setting arp_accept to 1. Great, seems like now I had a proper fix!

(The patch is merged, and will be part of the next kernel release. And in case you care, here is the bug for RHEL kernel to fix the unfortunate scenario.)

But why would kernel even handle gratuitous ARP differently for existing ARP table entries depending on arp_accept value? The sysctl setting was initially designed to only control behavior when an ARP packet for a previously unseen IP address was processed. So why the difference? In all honesty, there is no reason. It’s just a bug that sneaked into the kernel in the past. We figured it makes sense to fix it while we are at it, so I posted another kernel patch (this required some reshuffling and code optimization, hence the patch series). With the patch applied, all gratuitous ARP packets will now always update existing entries. (And yes, the patch is also merged and will be part of the next release.)

Of course, a careful reader may wonder why locktime even considers entry state transitions and not just actual ARP packets that are received on the wire, gratuitous or not. That’s a fair question, and I believe that the answer here is, “that’s another kernel bug”. That being said, brief look at the kernel code suggests that it won’t be too easy to make it work the way it should. It would require major rework of kernel neigh subsystem to make it track state transitions independently of MAC/IP transitions. I figured I better leave it to later (also known as never).

How do ARP table states work?

So at this point it seems like I have a set of solutions for the problem.

But one may ask, why has the very first gratuitous ARP packet arrived during locktime? If we look at the captured traffic, we don’t see any ARP packets before the gratuitous ARP burst.

But it turns out that you can get n->updated bumped even without a single ARP packet received! But how?

The thing is, the neighbour.c state machine will update the timestamp not just when a new ARP packet arrives (which happens through arp_process calling to neigh_update), but also when an entry transitions between states, and state transitions may be triggered from inside the kernel itself.

As we already mentioned before, the failing ARP entry cycles through STALE – DELAY – REACHABLE states. So how do entries transition to DELAY?

arp-transition-bad

As it turned out, the DELAY state is used when an existing STALE entry is consumed by an upper layer protocol. Though it’s STALE, it still can be used to connect to the IP address. What kernel does is, on the first upper layer protocol packet sent using a STALE entry, the entry is switched to DELAY, and a timer is set for +delay_first_probe_time in the future (5 seconds by default). When the timer is fired, the kernel then checks whether any upper layer protocol confirmed the entry as reachable. If it is confirmed, the kernel merely switches the state of the entry to REACHABLE; if it’s not confirmed, the kernel issues an ARP probe and updates its table with the result.

Since we haven’t seen a single probe sent during the failing test run, the working hypothesis became – the entry was always confirmed in those 5 seconds before the ARP probe, so the kernel never needed to send a single packet to refresh the entry.

But what is this confirmation anyway?

ARP confirmation

The thing is, it’s probably not very effective to immediately drop existing ARP entries when they become STALE. In most cases, those IP-to-MAC mappings are still valid even after the aging time: it’s not too often that IP addresses move from one device to another. So it would be not ideal if we would need to repeat ARP learning process each time an entry expires (each minute by default). It would be even worse if we would need to pause all other connections to an IP address whenever an ARP entry currently in use becomes STALE, to wait until the ARP table is updated. Since upper layer protocols (TCP, UDP, SCTP, …) may already successfully communicate with the IP address, we can use their knowledge about host availability and avoid unneeded probes, connectivity flips and pauses.

For that matter, Linux kernel has the confirmation mechanism. A lot of upper layer protocols support it, among those are TCP, UDP, and SCTP. Here is a TCP example. Whenever the confirmation aware protocol sees an incoming datagram from the MAC address using the IP address, it confirms the mapping to ARP layer, which then bails out of ARP probing and silently moves the entry to REACHABLE whenever the delay timer fires up.

Scary revelations

And what is this dst that is confirmed by calling to dst_confirm? It’s a pointer to a struct dst_entry. This structure defines a single cached routing entry. I won’t describe in details what it is, and how it’s different from struct fib_info that is an uncached routing entry (better explained in other sources).

What’s important for us to understand is that the entry may be not unique for a particular IP address. As long as outgoing packets take the same routing path, they may share the same dst_entry. And all the traffic directed to the same network subnet reuses the same routing path.

Which means that all the traffic directed from a controller node to the tempest node using any floating IP may potentially “confirm” any of ARP entries that belong to other IP addresses from the same range!

Since tempest tests are executed in parallel, and a lot of them send packets using a confirmation aware upper layer protocol (specifically, TCP for SSH sessions), ARP entries can effectively live throughout the whole test case run cycling through STALE – DELAY – REACHABLE states without issuing a single probe OR receiving any matching traffic for the IP/MAC pair.

And that’s how old MAC addresses proliferate. First we ignore all gratuitous ARP replies because of locktime; then we erroneously confirm wrong ARP entries.

And finally, my Red Hat kernel fellows pointed me to the following patch series that landed in 4.11 (released on May 1, 2017, just while I was investigating the failure). The description of the series and individual patches really hits the nail, for it talks about how dst_entry is shared between sockets, and how we can mistakenly confirm wrong ARP entries because of that.

I tried the patch series out. I reverted all my other kernel patches, then cherry-picked the series (something that is not particularly easy considering RHEL is still largely at 3.10, so for reference I posted the result on github), rebooted the system, and retried tests.

They passed. They passed over and over.

Of course, the node still missed all gratuitous ARP replies because of moving locktime window, but at least the kernel later realized that the entry is broken and requires a new ARP probe, which was correctly issued by the kernel, at which point the reply to the probe healed the cache and allowed tests to pass.

Great, now I had another alternative fix, and it was already part of an official kernel release!

The only problem with backporting the series is that the patches touch some data structures considered part of kernel ABI, so just putting the patches from the series into the package as-is triggered a legit KABI breakage during RPM build. For testing purposes, it was enough to disable the check but if we were going to backport the series into RHEL7, we needed some tweaks to the patches to retain binary compatibility (which is one of RHEL long term support guarantees).

For the reference, I opened a bug against RHEL7 kernel to deal with bogus ARP confirmations. At the time of writing, we hope to see it fixed in RHEL 7.4.

And that’s where we are. A set of kernel bugs combined – some new, some old, some recently fixed – produced the test breakage. If any of those bugs were not present in the environment, tests would have a chance to pass. Only the combination of small kernel mistakes and major screw-ups hit us hard enough to dig deep into the kernel.

In the next final post of the series, we will summarize all the workarounds and solutions that were found as part of the investigation of the Failure. We will also look at a potential OpenStack Neutron only workaround for the Failure that is based on our fresh understanding of the failing scenario.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s