The Failure, part 4: Summary

This post is the sixth, and the last, in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also introduced the Failure, covered its initial triaging, looked at traffic captured during failed test runs, and digged through Linux kernel code only to find multiple bugs in its ARP layer.

In this final post we will summarize what we have learned about the Failure. In addition, we will also discuss a possible fix that could help alleviate the impact while we are waiting for new kernels.

It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this last “episode”. Otherwise you loose all the fun of war stories.

Summary

In the previous post of the series, we finally learned what went wrong with the Failure, and we figured there is no OpenStack fault in it. Instead, it is Linux kernel that ignores all gratuitous ARP packets Neutron L3 agent eagerly sends its way; and it is the same Linux kernel that is spinning bogus ARP entries in perpetual STALE – DELAY – REACHABLE state change loop without a way out.

And that’s where we are at. We have the following alternatives to tackle the test failure problem.

or

  • Include the patch that avoids touching n->updated if an ARP packet is ignored;

or

  • Include the patch that forces override for all gratuitous ARP packets irrespective of arp_accept;

or

(Of course, ideally all of those pieces would find their way into your setup.)

All of those solutions currently require kernel backports, at least for RHEL7. In the meantime, could we do something just on Neutron side? On first sight, it sounds like a challenge. But when you think about it, we can use the knowledge about the failure mode to come up with a workaround that may work in most cases.

We know that the issue would not show itself up in tests if only any of gratuitous ARP replies sent by Neutron would be honored by kernel, and they are all ignored because of arrival during locktime window.

We know that the default value for locktime is 1 second, and the reason why all three gratuitous ARP packets are ignored is because each of them land into the moving locktime window, which happens because of the way we issue those packets using the arping tool. The default interval between gratuitous ARP packets issued by the tool is 1 second, but if we could make it longer, it could help the kernel to get out of the moving locktime window loop and honor one of packets sent in burst.

From looking at arping(8) and its code, it doesn’t seem like it supports picking an alternative interval with any of its command line options (I have sent a pull request to add the feature but it will take time until it gets into distributions). If we want to spread gratuitous updates in time using arping, we may need to call the tool multiple times from inside Neutron L3 agent and maintain time interval between packets ourselves.

Here is the Neutron patch to use this approach. This would of course work only with hosts that don’t change locktime sysctl setting from its default value. Moreover, it’s very brittle, and may still not give us 100% test pass guarantee.

The chance of success can of course be elevated by setting arp_accept on all nodes. The good news is that at least some OpenStack installers already do it. For example, see this patch for Fuel. While the original reasoning behind the patch is twisted (gratuitous ARP packets are accepted irrespective of arp_accept, it’s just that locktime may get in their way in bad timing), the change itself is still helpful to overcome limitations of released kernels. To achieve the same effect for TripleO, I posted a similar patch.

Finally, note that while all the workarounds listed above may somewhat help with passing tempest tests, without the patch series tackling the very real spurious confirmation issue, you still risk getting your ARP table broken without a way for it to self-heal.

The best advice I can give is, make sure you use a kernel that includes the patch series (all official kernels starting from 4.11 do). If you consume kernel from a distribution, talk to your vendor to get the patches included. Without them, you risk with your service availability. (To note, while I was digging this test only issue, we got reports from the field some ARP entries were staying invalid for hours after failover.)

And finally… it’s not always an OpenStack fault. Platform matters.

The Failure, part 3: Diggin’ the Kernel

This post is the fifth in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also introduced the Failure, covered its initial triaging, and looked at traffic captured during failed test runs.

All previous attempts to figure out what went wrong with the Failure hasn’t succeeded. In this post we will look closer at the Linux kernel code to learn about ARP table states, and to see if we can find any quirks in the kernel ARP layer that could explain the observations.

It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this “episode”.

Diggin’ the Kernel

Some of you may know that in my previous life, I was a Linux kernel developer (nothing fancy, mostly enabling embedded hardware). Though this short experience made myself more or less comfortable with reading the code, I figured I could use some help from the vast pool of Red Hat kernel developers. So I reached to Lance Richardson who, I was told, could help me figure out what’s going on with the Failure. And indeed, his help was enormous. In next several days, we discussed the kernel code on IRC, were digging old kernel mailing list archives, built and tested a bunch of custom kernels with local modifications to its networking layer. Here is what we’ve found.

Gratuitous ARP with arp_accept

Since RHEL7 kernel is quite old (3.10.0-514.22.1.el7 at the time of writing), we decided to start our search by looking at patches in Linus master branch and see if there were any that could be of relevance to the Failure, and that were not backported yet into RHEL7 kernel. The primary files of interest in the kernel source tree were net/ipv4/arp.c (the ARP layer) and net/core/neighbour.c (the neighbour layer which is an abstract representation of address-to-MAC mappings used for IPv4 as well as for IPv6).

Digging through the master branch history, the very first patch that drew our attention was: “ipv4: arp: update neighbour address when a gratuitous arp is received…” What the patch does is it forces override of an existing ARP table entry when a gratuitous ARP packet is received irrespective of whether it was received in locktime time interval. It is effective only when arp_accept is enabled, which is not the default. Anyway, that ringed some bell, and also suggested that maybe we dealt with a timing issue. The patch assumed arp_accept enabled and temporarily disabled the locktime behavior for gratuitous ARP packets, so let’s have a closer look at those two sysctl knobs.

Here is the documentation for the arp_accept sysctl knob. The setting controls whether the kernel should populate its ARP table with new entries on receiving ARP packets if the IP addresses are not registered in the table yet. Enabling the setting may be useful if you want to “warm up” the ARP table on system startup without waiting for the node to send its very first datagram to an IP address. The idea is that the kernel will listen for any ARP packets flying by, and it will create new table entries for previously unseen IP addresses. The default for the option is 0 (meaning off), and that’s for a reason. Enabling the feature may have unexpected consequences because the size of the kernel ARP table is limited, and in large network segments it may happen that the kernel will overflow the table with irrelevant entries due to the “warming up”. If that ever happens, the kernel may then start dropping some table entries that may still be useful. If that happens, you can see slowdown for some upper layer protocol connections for the time needed to restore the needed ARP entries using a round-trip of ARP probe packets. Long story short, the arp_accept setting is not for everyone.

As for locktime, there seems to be no in-tree documentation for the sysctl parameter, so the best source of information is probably arp(7). Quoting: “[locktime is] the minimum number of jiffies to keep an ARP entry in the cache. This prevents ARP cache thrashing if there is more than one potential mapping (generally due to network misconfiguration). Defaults to 1 second.” What it means is that if an ARP packet arrives during a 1 second interval since the previous ARP packet, it will be ignored. This is helpful when using ARP proxies where multiple network endpoints can reply to the same ARP REQUEST. In this case, you may want to ignore those replies that arrive later (to avoid so called ARP thrashing, as well as to stick to the node that is allegedly quicker/closer to the node).

With the above mentioned kernel patch, and arp_accept set to 1, the kernel should always update its ARP table if a gratuitous ARP packet is received, even if the entry is still in the locktime time interval.

Though arp_accept is not applicable for everyone, it was still worth exploring. I backported the patch into RHEL7 kernel, rebooted the tempest node, enabled arp_accept for eth2, and rerun the tests. Result? Same failures. So why hasn’t it worked?

Code inspection of neigh_update hasn’t revealed anything interesting. Everything suggested that override was still false. It took me awhile, but then it struck me: the code to determine whether an ARP packet is gratuitous considered frames of Request type, but not Reply. And Neutron L3 agent sends Replies (note the -A option instead of -U using in the arping command line)!
Here is how arping(8) defines those options:

-A     The same as -U, but ARP REPLY packets used instead of ARP REQUEST.
-U     Unsolicited ARP mode to update neighbours' ARP caches.  No replies are expected.

The next step was clear: let’s try to switch all Neutron L3 agents to gratuitous ARP requests and see if it helps. So I applied a one-liner to all controller nodes, restarted neutron-l3-agent services, and repeated the test run. It passed. I even passed multiple times in a row, first time in a long time I was banging my head over the issue!

OK, now I had a workaround. To pass tests, all I needed was:

  • Get a kernel that includes the patch (officially released as 3.14 on Mar 30, 2014);
  • Enable arp_accept for the external (eth2) interface;
  • Restart neutron-l3-agent services with the one-liner included.

But does it make sense that the kernel accepts gratuitous REQUESTs but not REPLYs? Is there anything in RFCs defining ARP that would suggest REPLYs are a different beast? Let’s have a look.

As we’ve learned in the very first post in the series, gratuitous ARP packets are defined in RFC 2002. Let’s quote the definition here in full.

-  A Gratuitous ARP [23] is an ARP packet sent by a node in order to
 spontaneously cause other nodes to update an entry in their ARP
 cache.  A gratuitous ARP MAY use either an ARP Request or an ARP
 Reply packet.  In either case, the ARP Sender Protocol Address
 and ARP Target Protocol Address are both set to the IP address
 of the cache entry to be updated, and the ARP Sender Hardware
 Address is set to the link-layer address to which this cache
 entry should be updated.  When using an ARP Reply packet, the
 Target Hardware Address is also set to the link-layer address to
 which this cache entry should be updated (this field is not used
 in an ARP Request packet).

So clearly both gratuitous ARP “flavors”, REQUEST and REPLY, are defined by the standard. There should be no excuse for the kernel to handle valid gratuitous REPLY packets in any other way than REQUESTs. To fix the wrongdoing, I posted a patch that makes the kernel to honor gratuitous REPLYs the same way as it does REQUESTs. (The patch is now merged in netdev master.)

Even though the kernel fix landed and will probably be part of the next 4.12 release, OpenStack Neutron still needs to deal with the situation somehow for the sake of older kernel, so it probably makes sense to issue REQUESTs from Neutron L3 agents to help those who rely on arp_accept while using an official kernel release. The only question is, should we issue both REQUESTs and REPLYs, or just REQUESTs? For Linux network peers, REQUESTs work just fine, but is there a risk that some other networking software stack honors REPLYs but not REQUESTs?.. To stay on safe side, we decided to issue both.

Anyhow, we discussed before that arp_accept is not for everyone, and there is a good reason why it’s not enabled by default. OpenStack should work irrespective of the sysctl knob value set on other network hosts, that’s why the patches mentioned above couldn’t be considered a final solution.

Besides, arp_accept only disabled locktime mechanism for gratuitous ARP packets, but we haven’t seen any ARP packets before the first gratuitous packet arrived. So why hasn’t the kernel honored it anyway?

Locktime gets in the way of gratuitous ARP packets

As we’ve already mentioned, without arp_accept enforcement, neigh_update hasn’t touched the MAC address for the corresponding ARP table entry. Code inspection suggested that the only case when it could happen was if arp_process passes flags=0 and not flags=NEIGH_UPDATE_F_OVERRIDE into neigh_update. And in the RHEL 7.3 kernel, the only possibility for that to happen is when all three gratuitous ARP replies would arrive in locktime time interval.

But they are sent with a 1-second interval between them, and the default locktime value is 1 second too, so at least the last, or even the second packet in the 3-set should have affected the kernel. Why hasn’t it?..

Let’s look again at how we determine whether an update arrived during the locktime:

override = time_after(jiffies, n->updated + n->parms->locktime);

What the quoted code does is it checks whether an ARP packet is received in [n->updated; n->updated + n->params->locktime] time interval, where n->params->locktime = 100. And what does n->updated represent?

If override is false, call to neigh_update will bail out without updating the MAC address or the ARP entry state. But see what it does just before bailing out: it sets n->updated to current time!

So what happens is that the first gratuitous ARP in the 3-series arrives in locktime interval; it calls neigh_update with flags=0 that updates n->updated and bails out. By moving n->updated forward, it also effectively moves forward the locktime window without actually handling a single frame that would justify that! The next time the second gratuitous ARP packet arrives, it’s again in the locktime window, so it again calls neigh_update with flags=0, which again moves the locktime window forward, and bails out. The exact same story happens for the third gratuitous ARP packet we send from Neutron L3 agent.

So where are we at the end of the scenario? The ARP entry never changed its MAC address to reflect what we advertised with gratuitous ARP replies, and the kernel networking stack is not aware of the change.

This moving window business didn’t seem right to me. There is a reason for locktime, but its effect should not take longer than its value (1 second), and this was clearly not what we’ve seen. So I poked the kernel a bit more and came up with a patch that avoids updating n->updated if neither entry state nor its MAC address would change on neigh_update. With the patch applied to RHEL kernel, I was able to pass previously failing test runs without setting arp_accept to 1. Great, seems like now I had a proper fix!

(The patch is merged, and will be part of the next kernel release. And in case you care, here is the bug for RHEL kernel to fix the unfortunate scenario.)

But why would kernel even handle gratuitous ARP differently for existing ARP table entries depending on arp_accept value? The sysctl setting was initially designed to only control behavior when an ARP packet for a previously unseen IP address was processed. So why the difference? In all honesty, there is no reason. It’s just a bug that sneaked into the kernel in the past. We figured it makes sense to fix it while we are at it, so I posted another kernel patch (this required some reshuffling and code optimization, hence the patch series). With the patch applied, all gratuitous ARP packets will now always update existing entries. (And yes, the patch is also merged and will be part of the next release.)

Of course, a careful reader may wonder why locktime even considers entry state transitions and not just actual ARP packets that are received on the wire, gratuitous or not. That’s a fair question, and I believe that the answer here is, “that’s another kernel bug”. That being said, brief look at the kernel code suggests that it won’t be too easy to make it work the way it should. It would require major rework of kernel neigh subsystem to make it track state transitions independently of MAC/IP transitions. I figured I better leave it to later (also known as never).

How do ARP table states work?

So at this point it seems like I have a set of solutions for the problem.

But one may ask, why has the very first gratuitous ARP packet arrived during locktime? If we look at the captured traffic, we don’t see any ARP packets before the gratuitous ARP burst.

But it turns out that you can get n->updated bumped even without a single ARP packet received! But how?

The thing is, the neighbour.c state machine will update the timestamp not just when a new ARP packet arrives (which happens through arp_process calling to neigh_update), but also when an entry transitions between states, and state transitions may be triggered from inside the kernel itself.

As we already mentioned before, the failing ARP entry cycles through STALE – DELAY – REACHABLE states. So how do entries transition to DELAY?

arp-transition-bad

As it turned out, the DELAY state is used when an existing STALE entry is consumed by an upper layer protocol. Though it’s STALE, it still can be used to connect to the IP address. What kernel does is, on the first upper layer protocol packet sent using a STALE entry, the entry is switched to DELAY, and a timer is set for +delay_first_probe_time in the future (5 seconds by default). When the timer is fired, the kernel then checks whether any upper layer protocol confirmed the entry as reachable. If it is confirmed, the kernel merely switches the state of the entry to REACHABLE; if it’s not confirmed, the kernel issues an ARP probe and updates its table with the result.

Since we haven’t seen a single probe sent during the failing test run, the working hypothesis became – the entry was always confirmed in those 5 seconds before the ARP probe, so the kernel never needed to send a single packet to refresh the entry.

But what is this confirmation anyway?

ARP confirmation

The thing is, it’s probably not very effective to immediately drop existing ARP entries when they become STALE. In most cases, those IP-to-MAC mappings are still valid even after the aging time: it’s not too often that IP addresses move from one device to another. So it would be not ideal if we would need to repeat ARP learning process each time an entry expires (each minute by default). It would be even worse if we would need to pause all other connections to an IP address whenever an ARP entry currently in use becomes STALE, to wait until the ARP table is updated. Since upper layer protocols (TCP, UDP, SCTP, …) may already successfully communicate with the IP address, we can use their knowledge about host availability and avoid unneeded probes, connectivity flips and pauses.

For that matter, Linux kernel has the confirmation mechanism. A lot of upper layer protocols support it, among those are TCP, UDP, and SCTP. Here is a TCP example. Whenever the confirmation aware protocol sees an incoming datagram from the MAC address using the IP address, it confirms the mapping to ARP layer, which then bails out of ARP probing and silently moves the entry to REACHABLE whenever the delay timer fires up.

Scary revelations

And what is this dst that is confirmed by calling to dst_confirm? It’s a pointer to a struct dst_entry. This structure defines a single cached routing entry. I won’t describe in details what it is, and how it’s different from struct fib_info that is an uncached routing entry (better explained in other sources).

What’s important for us to understand is that the entry may be not unique for a particular IP address. As long as outgoing packets take the same routing path, they may share the same dst_entry. And all the traffic directed to the same network subnet reuses the same routing path.

Which means that all the traffic directed from a controller node to the tempest node using any floating IP may potentially “confirm” any of ARP entries that belong to other IP addresses from the same range!

Since tempest tests are executed in parallel, and a lot of them send packets using a confirmation aware upper layer protocol (specifically, TCP for SSH sessions), ARP entries can effectively live throughout the whole test case run cycling through STALE – DELAY – REACHABLE states without issuing a single probe OR receiving any matching traffic for the IP/MAC pair.

And that’s how old MAC addresses proliferate. First we ignore all gratuitous ARP replies because of locktime; then we erroneously confirm wrong ARP entries.

And finally, my Red Hat kernel fellows pointed me to the following patch series that landed in 4.11 (released on May 1, 2017, just while I was investigating the failure). The description of the series and individual patches really hits the nail, for it talks about how dst_entry is shared between sockets, and how we can mistakenly confirm wrong ARP entries because of that.

I tried the patch series out. I reverted all my other kernel patches, then cherry-picked the series (something that is not particularly easy considering RHEL is still largely at 3.10, so for reference I posted the result on github), rebooted the system, and retried tests.

They passed. They passed over and over.

Of course, the node still missed all gratuitous ARP replies because of moving locktime window, but at least the kernel later realized that the entry is broken and requires a new ARP probe, which was correctly issued by the kernel, at which point the reply to the probe healed the cache and allowed tests to pass.

Great, now I had another alternative fix, and it was already part of an official kernel release!

The only problem with backporting the series is that the patches touch some data structures considered part of kernel ABI, so just putting the patches from the series into the package as-is triggered a legit KABI breakage during RPM build. For testing purposes, it was enough to disable the check but if we were going to backport the series into RHEL7, we needed some tweaks to the patches to retain binary compatibility (which is one of RHEL long term support guarantees).

For the reference, I opened a bug against RHEL7 kernel to deal with bogus ARP confirmations. At the time of writing, we hope to see it fixed in RHEL 7.4.

And that’s where we are. A set of kernel bugs combined – some new, some old, some recently fixed – produced the test breakage. If any of those bugs were not present in the environment, tests would have a chance to pass. Only the combination of small kernel mistakes and major screw-ups hit us hard enough to dig deep into the kernel.

In the next final post of the series, we will summarize all the workarounds and solutions that were found as part of the investigation of the Failure. We will also look at a potential OpenStack Neutron only workaround for the Failure that is based on our fresh understanding of the failing scenario.

The Failure, part 2: Beyond OpenStack

This post is the fourth in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also set the stage for the Failure, and dismissed easier causes of it.

This post will continue discussion of the Failure beyond OpenStack components. It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this “episode”.

To recap, a bunch of tempest connectivity tests were failing in RH-OSP 11 environment when connecting to a floating IP address. Quick log triage suggested that internal connectivity for affected instances worked fine, and that Neutron L3 agent called to arping tool at the right time. Still “undercloud” node (the node executing tempest) got timeout failures.

Deployment

Before diving deeper, let’s make a step back and explore the deployment more closely. How are those floating IP addresses even supposed to work in this particular setup? Is the tempest node on the same L2 network segment with the floating IP range, or is it connected to it via a L3 router?

The failing jobs deploy Red Hat OpenStack using Director also known as TripleO. It’s not a very easy task to deploy a cloud using bare TripleO, and for this reason there are several tools that prepare development and testing environments. One of those tools is tripleo-quickstart, more popular among upstream developers, while another one is Infrared, more popular among CI engineers. The failing jobs were all deployed with Infrared.

Infrared is a very powerful tool. I won’t get into details, but to give you a taste of it, it supports multiple compute providers allowing to deploy the cloud in libvirt or on top of another OpenStack cloud. It can deploy both RH-OSP and RDO onto provisioned nodes. It can use different installers (TripleO, packstack, …). It can also execute tempest tests for you, collect logs, configure SSH tunnels to provisioned nodes to access them directly… Lots of other cool features come with it. You can find more details about the tool in its documentation.

As I said, the failing jobs were all deployed using Infrared on top of a powerful remote libvirt hypervisor where a bunch of nodes with distinct roles were created:

  • a single “undercloud” node that is used to provision the actual “overcloud” multinode setup (this node also runs tempest);
  • three “overcloud” controllers all hosting Neutron L3 agents;
  • a single “overcloud” compute node hosting Nova instances.

All the nodes were running as KVM guests on the same hypervisor, connected to each other with multiple isolated libvirt networks, each carrying a distinct type of traffic. After Infrared deployed “undercloud” and “overcloud” nodes, it also executed “neutron” set of tests that contains both api and scenario tests, from both tempest and neutron trees.

As I already mentioned, the “undercloud” node is the one that also executes tempest tests. This node is connected to an external network that hosts all floating IP addresses for the preconfigured public network, with eth2 of the node directly plugged into it. The virtual interface is consequently plugged into the hypervisor external Linux kernel bridge, where all other external (eth2) interfaces for all controllers are plugged too.

What it means for us is that the tempest node is on the same network segment with all floating IP addresses and gateway ports of all Neutron routers. There is no router between the floating IP network and the tempest node. Whenever a tempest test case attempts to establish an SSH connection to a floating IP address, it first consults the local ARP table to possibly find an appropriate IP-to-MAC mapping there, and if the mapping is missing, it will use the regular ARP procedure to retrieve the mapping.

Now that we understand the deployment, let’s jump into the rabbit hole.

Traffic capture

The initial debugging steps on OpenStack Neutron side haven’t revealed anything useful. We identified that Neutron L3 agent correctly called to arping with the right arguments. So in the end, maybe it’s not OpenStack fault?

First thing to check would be to determine whether arping actually sent gratuitous ARP packets, and that they reached the tempest node, at which point we could expect that the “undercloud” kernel would honor them and update its ARP table. I figured it’s easier to only capture external (eth2) traffic on the tempest node. I expected to see those gratuitous ARP packets there, which would mean that the kernel received (and processed) them.

Once I got my hands on hardware capable of standing up the needed TripleO installation, I quickly deployed the needed topology using Infrared.

Of course, it’s a lot easier to capture the traffic on tempest node but analyze it later in Wireshark. So that’s what I did.

$ sudo tcpdump -i eth2 -w external.pcap

I also decided it may be worth capturing ARP table state during a failed test run.

$ while true; do date >> arptable; ip neigh >> arptable; sleep 1; done

And finally, I executed the tests. After 40 minutes of waiting for results, one of tests failed with the expected SSH timeout error. Good, time to load the .pcap into Wireshark.

The failing IP address was 10.0.0.221, so I used the following expression to filter the relevant traffic:

ip.addr == 10.0.0.221 or arp.dst.proto_ipv4 == 10.0.0.221

The result looked like this:

wireshark-garp-failure

Here, we see the following: first, SSH session is started (frame 181869), it of course initially fails (frame 181947) because gratuitous ARP packets start to arrive later (frames 181910, 182023, 182110). But then for some reason consequent TCP packets are still sent using the old 52:54:00:ac:fb:f4 destination MAC address. It seemed like arriving gratuitous ARP packets were completely ignored by the kernel. More importantly, the node continued sending TCP packets to 10.0.0.221 using the old MAC address even past expected aging time (60 seconds), and it never issued a single ARP REQUEST packet to update its cache throughout the whole test case execution (!). Eventually, after ~5 minutes of banging the wrong door the test case failed with a SSH timeout.

How could it happen? The kernel is supposed to honor the new MAC address right after it receives an ARP packet!

Now, if we would compare the traffic dump to a successful test case run, we could see the traffic capture that looked more like below (in this case, the IP address of interest is 10.0.0.223):

wireshark-garp-success

Here we see TCP retransmission of SSH, port 22, packets (74573 and 74678), which failed to deliver because the kernel didn’t know the new fa:16:3e:bd:c1:97 MAC address just yet. Later, we see a burst of gratuitous ARP packets sent from the router serving 10.0.0.223, advertising the new MAC address (frames 7471574801, and 74804). Though it doesn’t immediately suggest that these were gratuitous ARP packets that healed the connectivity, it’s clear that the tempest node quickly learned about the new MAC address and continued with its SSH session (frames 74805 and forward).

One thing that I noticed while looking at multiple traffic captures from different test runs is that whenever a test failed, it always failed on a reused floating IP address. Those would show up in Wireshark with the following warning message:

duplicate

State transitions

Then maybe there was a difference in ARP entry state between successful and failing runs? Looking at the ARP table state dumps I collected during a failing run, the following could be said:

  • Before the start of the failed test case, the corresponding ARP entry was in STALE state.
  • Around the time when gratuitous ARP packets were received and the first TCP packet was sent to the failing IP address, the entry transitioned to DELAY state.
  • After 5 seconds, it transitioned to REACHABLE without changing its MAC address. No ARP REQUEST packets were issued in between.
  • The same STALE – DELAY – REACHABLE transitions were happening for the affected IP address over and over. The tempest node hasn’t issued a single ARP REQUEST during the whole test case execution. Neither it received any new traffic that would use the old MAC address.

arp-transition-bad

If we compare this to ARP entries in “good” runs, we see that there they also start in STALE state, then transition to DELAY, but after 5 seconds, instead of transitioning to REACHABLE, it switched to PROBE state. The node then issued a single ARP REQUEST (could be seen in the captured traffic dump), quickly received a reply from the correct Neutron router, updated the ARP entry with the new MAC address, and only then finally transitioned to REACHABLE. At this point the connectivity healed itself.

arp-transition-good

What made the node behave differently in those two seemingly identical cases? Why hasn’t it issued a ARP probe in the first case? What are those ARP table states anyway? I figured it was time to put my kernel hat on and read some Linux code.

In the next post in the series, I am going to cover my journey into kernel ARP layer. Stay tuned.

 

The Failure, part 1: It’s OpenStack fault

I’m not actually good at computers I’m just bad at giving up. // mjg59

This post is the third in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is, and how it’s implemented in OpenStack Neutron L3 agent.

This post is short and merely sets the stage for a story that started a month ago inside Red Hat Networking team and that ultimately led me into learning how ARP actually works in OpenStack and in Linux. In next posts, I will expand on the story and discuss possible solutions to the problem that I will introduce below. (The plan is to post all four remaining pieces in next four days.)

But let’s not get dragged too far.

The stage

Failure mode

It was two months before the next shiny Ocata-based release of Red Hat OpenStack Platform (RH-OSP 11). Now that the team focus shifted from upstream development to polishing the product, we started looking more closely at downstream CI jobs. As usual with new releases, there were several failures in our tempest jobs. For most of them we figured out a possible culprit and landed fixes. For most of them, except The Failure.

Well, it was not a single test case that was failing, more like a whole class of them. In those affected test jobs, we execute all tempest tests, both api and scenario, and what we noticed is that a lot of scenario test cases were failing on connectivity checks when using a floating IP (but, importantly, never a fixed IP). Only positive connectivity checks were failing (meaning, cases where connectivity was expected but failed; never a case where lack of connectivity was expected).

There are two types of connectivity checks in tempest: ping check and SSH check. The former sends ICMP (or ICMPv6) datagrams to an IP address under test for 120 seconds and expects a single reply, while the latter establishes a SSH session to the IP address and waits for successful authentication.

In the failing jobs, whenever a ping check failed, the following could be seen in the tempest log file:

2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager Traceback (most recent call last):
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 624, in check_public_network_connectivity
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     mtu=mtu)
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 607, in check_vm_connectivity
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     self.fail(msg)
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     raise self.failureException(msg)

When it was a SSH check that failed, then the error looked a bit different:

2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     sock=proxy_chan)
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 305, in connect
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     retry_on_signal(lambda: sock.connect(addr))
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/util.py", line 269, in retry_on_signal
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     return function()
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 305, in
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     retry_on_signal(lambda: sock.connect(addr))
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib64/python2.7/socket.py", line 224, in meth
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     return getattr(self._sock,name)(*args)
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh timeout: timed out
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh
2017-05-05 21:29:41.997 4456 ERROR tempest.scenario.manager [-] (TestGettingAddress:test_multi_prefix_dhcpv6_stateless) Initializing SSH connection to 10.0.0.214 failed. Error: Connection to the 10.0.0.214 via SSH timed out.

If you would pick a single test case that sometimes failed, it didn’t have a high risk to impact job results, but once you aggregate all failures from all affected test cases, the chance of successfully passing the whole test run became abysmal, around 10%, which was clearly not ideal.

So I figured I may have a look at that, naively assuming it will take a day or two to find the root cause, fix it and move on with my life. Boy I was wrong! It took me a month to get to the bottom of it (though in all honesty, most of the time was spent trying to setup environment that would consistently reproduce the issue).

First steps

Initially, I figured failures are most often happening in our L3 HA jobs, so I focused on one of those. Reading through tempest, neutron-server, neutron-openvswitch-agent, and neutron-l3-agent log files hasn’t revealed much.

When looking at a particular test failure, we could see that the instance that was carrying the failing floating IP successfully booted and received a DHCPv4 lease for its fixed IP, as seen in its console log that tempest gladly dumps on connectivity check failures:

Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.100.0.11...
Lease of 10.100.0.11 obtained, lease time 86400

To  cross-check, we could also find the relevant lease allocation event in the system journal:

May 06 06:02:57 controller-1.localdomain dnsmasq-dhcp[655233]: DHCPDISCOVER(tap7876deff-b8) fa:16:3e:cc:2c:30
May 06 06:02:57 controller-1.localdomain dnsmasq-dhcp[655233]: DHCPOFFER(tap7876deff-b8) 10.100.0.11 fa:16:3e:cc:2c:30

The tempest log file clearly suggested that the failure was not because of a SSH key pair misbehaving for the failing instance. If that would be a public SSH key pair not deployed to the instance, we would not see SSH timeouts but authentication failures. Neither we saw any SSH timeouts for tests that established SSH sessions using internal fixed IP addresses of instances.

It all suggested that internal (“tenant”) network connectivity worked fine. The problem was probably isolated somewhere in Neutron L3 agent. But looking into Neutron L3 agent and neutron-server logs hasn’t surfaced any problem either: we could easily find relevant arping calls in Neutron L3 agent (for L3 legacy routers) or the system journal (for Neutron L3 HA routers).

Of course, you never believe that a legit fault may be a result of software that is external to your immediate expertise. It’s never compiler or other development tool fault; 99% of whatever you hit each day is probably your fault. And in my particular case, it was, obviously, OpenStack Neutron that was guilty. So it took me awhile to start looking at other places.

But after two weeks of unproductive code and log reading and adding debug statements in Neutron code, it was finally time to move forward to unknown places.

In the next post of the series, we will explore what I revealed once I shifted my accusatory gaze towards platform residing below OpenStack.

Gratuitous ARP for OpenStack Neutron

This post is the second in a series that covers gratuitous ARP and its relation to OpenStack. In the previous post, we discussed what gratuitous ARP is, and how it helps in IT operations. We also briefly touched on how it’s used in OpenStack Neutron L3 agent.

This post will expand on gratuitous ARP packets’ usage in OpenStack Neutron L3 agent. We will also dive deep into how they are implemented.

Gratuitous ARP in OpenStack Neutron

Usage

In the previous post, we already briefly touched on where gratuitous ARP is used in OpenStack Neutron.

To recollect, the primary use for gratuitous ARP in OpenStack Neutron L3 agent is to update network peers about the new location of a “floating” IP address (“elastic” in AWS-speak) when it’s disassociated from one port and then associated to another port with a different MAC address. Without issuing a gratuitous ARP on new association, it may take significant time before a reused floating IP address mapping is updated as a result of the “aging” process.

Gratuitous ARP is also used by the L3 agent to implement HA for Neutron routers. Whenever a new HA router instance becomes “master”, it adds IP addresses managed by Neutron to its interfaces and issues a set of gratuitous ARP packets into attached networks to advertise the new location. Network peers then update their ARP tables with new MAC addresses from those packets and in this way don’t need to wait for old entries to expire before connectivity would be restored. The switch to the new router instance is then a lot smoother.

Implementation

There are two distinct implementations for gratuitous ARP in OpenStack Neutron, one for each distinct router deployment mode: legacy and HA. The difference comes primarily from the fact that legacy router data plane is fully realized by the L3 agent, while HA routers “outsource” IP address management to keepalived daemon spawned by the agent. (The third deployment mode – DVR – is largely covered by those two, where specific implementation depends on whether DVR routers are also HA or not; for this reason I won’t mention DVR going forward).

Let’s consider each distinct deployment mode separately, starting with legacy.

Legacy routers

Legacy mode is what once was the only mode supported by OpenStack Neutron. In this mode, the L3 agent itself implements the whole data plane, creating network namespaces for routers, creating ports, plugging them into the external br-ex bridge, and adding fixed and floating IP addresses to router ports. Besides that, the agent also issues gratuitous ARP packets into attached networks when a new IP address is added to one of its ports. This is to update network peers about the new mappings. Peers may use those unsolicited updates either to update any existing ARP entries with a new MAC address, or to “warm up” their tables with IP-to-MAC mappings even before the very first IP datagram is issued to the router IP address (this is something that Linux kernel does when arp_accept sysctl setting is enabled for the receiving interface).

When the L3 agent sends gratuitous ARP packets for an IP address, this is what you can find in the agent log file:

2017-04-28 20:53:11.264 14176 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-726095be-5916-489b-be05-860e2f19d556', 'ip', '-4', 'addr', 'add', '10.1.0.1/26', 'scope', 'global', 'dev', 'qr-864545b9-5f', 'brd', '10.1.0.63'] execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108

And then later:

2017-04-28 20:53:11.425 14176 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-726095be-5916-489b-be05-860e2f19d556', 'arping', '-A', '-I', 'qr-864545b9-5f', '-c', '3', '-w', '4.5', '10.1.0.1'] execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108

As you have probably figured out, the first snippet shows the agent adding a new IPv4 address 10.1.0.1 to an internal router qr-864545b9-5f port, and the second snippet is where the agent sends gratuitous ARP packets advertising the new IP address into the network to which the qr-864545b9-5f port is attached to (this is achieved by calling the arping tool from iputils package with the right arguments).

Let’s have a look at each of the arguments passed into arping tool.

  • The very first option is -A, and it’s used to issue gratuitous (broadcast) ARP packets. Without the option, the tool would send unicast ARP REQUEST packets for the IP address, and would wait for a REPLY (the unicast mode may be useful when you need to check if there are any other hosts in the network carrying the same IP address, or to sanity check an existing IP-to-MAC mapping). The packets sent are of REPLY type. (If we would use -U instead, it would send REQUEST packets.)
  • The next option is -I, and it specifies the interface to issue the packets on.
  • The -c option defines the number of ARP packets to issue into the network. There is always a 1 second interval between the packets. Since we call it with -c 3, it issues three packets in two second time span.
  • The next option is -w 4.5 and it means that we will wait for 4.5 seconds (or better, 4 seconds because the tool doesn’t recognize floating part of the argument) before exiting it. In general, the tool will exit after two seconds, but when the interface used to send packets is gone while the tool is running, it may block its execution since it will never be able to successfully send all three packets. The option guarantees that the thread running the tool will eventually make progress.
  • The very last argument is the IP address to advertise. A single port may carry multiple IPv4 addresses, so it’s crucial to define which of those addresses should be advertised.

HA routers

HA support is a relatively new addition to OpenStack Neutron routers. To use HA for Neutron routers, one should configure Neutron API controller to expose l3-ha API extension, at which point users are able to create highly available routers.

For those routers, data plane is managed both by the L3 agent as well as the keepalived daemon that the agent spawns for every HA router it manages. The agent first prepares the router namespace, its ports, rules for NAT translation; but then it falls back to the keepalived daemon which manages IP addresses on ports. For this matter, the agent generates a configuration file listing all managed IP addresses and passes it into keepalived. The daemon then starts, negotiates with other keepalived processes implementing the HA router who is going to be its “master” (for this matter, VRRP is used), and if it’s indeed “master”, it triggers state transition machinery, which, among other things, will add managed IP addresses specified in the configuration file to appropriate router ports. It will also send gratuitous ARP packets into the network to update peers about the location of those IP addresses. If you then inspect your system log, you may find the following messages there:

May  2 13:19:47 host-192-168-24-12 Keepalived[307081]: Starting Keepalived v1.2.13 (07/01,2016)
May  2 13:19:47 host-192-168-24-12 Keepalived[307082]: Starting VRRP child process, pid=307083
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Netlink reflector reports IP 169.254.192.6 added
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Netlink reflector reports IP fe80::f816:3eff:fe5f:d44b added
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Registering Kernel netlink reflector
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Registering Kernel netlink command channel
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Registering gratuitous ARP shared channel
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Opening file '/var/lib/neutron/ha_confs/b7fece4b-ea95-4eb6-b7b8-dc060325d1bc/keepalived.conf'.
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Configuration is using : 64829 Bytes
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: Using LinkWatch kernel netlink reflector...
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Entering BACKUP STATE
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) removing protocol Virtual Routes
May  2 13:19:47 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP sockpool: [ifindex(16), proto(112), unicast(0), fd(10,11)]
May  2 13:19:54 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Transition to MASTER STATE
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Entering MASTER STATE
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) setting protocol VIPs.
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) setting protocol E-VIPs.
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) setting protocol Virtual Routes
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Sending gratuitous ARPs on ha-e09aa535-6f for 169.254.0.1
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Sending gratuitous ARPs on qg-6cf347df-28 for 10.0.0.219
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Sending gratuitous ARPs on qr-3ee577eb-4f for 10.100.0.1
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Sending Unsolicited Neighbour Adverts on qr-3ee577eb-4f for fe80::f816:3eff:fe9a:c17
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: VRRP_Instance(VR_1) Sending Unsolicited Neighbour Adverts on qg-6cf347df-28 for fe80::f816:3eff:fec7:861a
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: Netlink reflector reports IP fe80::f816:3eff:fe9a:c17 added
May  2 13:19:56 host-192-168-24-12 Keepalived_vrrp[307083]: Netlink reflector reports IP fe80::f816:3eff:fec7:861a added

Here we can see keepalived transitioning to master state and immediately issuing gratuitous updates after VIP addresses are set for managed interfaces. (A careful reader will also notice that it also issues something called Unsolicited Neighbour Adverts which is a similar mechanism for IPv6 addresses, but I won’t go there.)

It would seem like it’s good for the job. Sadly, the reality is uglier than one could hope.

WTF#1: HA router reload doesn’t issue gratuitous ARP packets

As we’ve learned during our testing of the HA feature, sometimes keepalived forgot to send gratuitous ARP packets. It always happened when an existing keepalived instance was asked to reload its configuration file because some Neutron API operations triggered router updates that affected the file contents. An example of an update could be e.g. adding a new floating IP address to a port, or disassociating one. In this case, Neutron L3 agent would generate a new configuration file and then send SIGHUP signal to the running keepalived instance, hoping that it will catch the changes, converge the data plane to latest configuration, and finally issue gratuitous ARP updates. It did not.

Investigation, largely carried by John Schwarz, uncovered it was not an issue with latest keepalived releases, but the one from RHEL7 repositories. Bisecting releases, we’ve found out that the very first keepalived release that was not exposing the buggy behavior was 1.3.20. Popular distributions (RHEL7, Ubuntu Xenial) were still shipping older versions of the daemon (1.2.13 for RHEL7 and 1.2.19 for Xenial).

Though the issue was technically in keepalived, we needed to adopt OpenStack to the buggy releases shipped with platforms we support. First considered option was just fully restarting keepalived, which would correctly trigger the gratuitous ARP machinery. The problem with this approach was that full restart temporarily stops the VRRP thread that sends master health checks, and with unfortunate timing, it sometimes results in an unnecessary “master” to “backup” flip, operation that is both computationally costly as well as data plane disruptive.

Since we couldn’t just upgrade keepalived, it meant that Neutron L3 agent would need to play some role in issuing gratuitous ARP packets, not relying on the daemon to do the right job. For this matter, Neutron patch was introduced. What the patch does is it calls to arping tool whenever a new IPv4 address is added to an interface managed by keepalived. A new address added indicates that VRRP negotiation resulted in the locally running keepalived instance transitioning to “master”; or it means a new floating IP address was added in the configuration file just reloaded by the daemon. At this point it makes sense to advertise the newly added addresses on the wire using gratuitous ARP, something that in an ideal world keepalived would do for us.

We already had the neutron-keepalived-state-change helper daemon running inside HA router network namespaces that monitors router interfaces for new IP addresses to detect transitions between keepalived states and then sends the information back to neutron-server. To avoid introducing a new daemon just to issue gratuitous ARP packets, we figured it’s easier to reuse the existing one.

Of course, issuing gratuitous ARP packets from outside of keepalived introduced some complications.

For one, the whole setup became slightly racy. For example, what happens when keepalived decides to forfeit its mastership in the middle of neutron-keepalived-state-change sending gratuitous ARP packets? Will we continue sending those packets into the network even after keepalived removed VIP addresses from its interfaces? Thanks to net.ipv4.ip_nonlocal_bind sysctl knob, it shouldn’t be a concern. Its default value (0) means that userspace tools (including arping) won’t be able to send an ARP packet for an IPv4 address that is not on the interface. If we hit the race, the worst that could happen is that arping would hang, failing to send more gratuitous ARP packets into the network, logging the “bind: Cannot assign requested address” error message on its stderr. Since we set a hard time limit for the tool execution (remember the -w 4.5 CLI arguments discussed above), it should be fine. To stay on safe side, we would just set the sysctl knob inside each new router namespace to 0 to override whatever custom value the platform may have for the setting.

There are still two complications with that though.

First, as it turned out, the ip_nonlocal_bind knob was set to 1 for DVR fip namespaces, and for a reason. So we needed to make sure that it’s set to 0 in all router namespaces except fip. Another issue that we surfaced was specific to RHEL7 kernel where the ip_nonlocal_bind knob was not network namespace aware, so changing it in one of namespaces affected all other routers. It was fixed in later RHEL7 kernels, and in the meantime, we could only hope that no one ever hosts both DVR fip and HA qrouter namespaces on the same node, for they would clash.

WTF#2: keepalived forfeits mastership on multiple SIGHUPs sent in quick succession

Not completely related to gratuitous ARP, but since it’s also about SIGHUP handler, I figured I will mention this issue here too.

Some testing revealed that when multiple HA router updates arrived to Neutron L3 agent in quick succession, keepalived sometimes forfeits its mastership, flipping to “backup” with no apparent reason. Consequent network disruption until a new keepalived “master” instance is elected included.

Further investigation, also led by John Schwarz, revealed that it always happens when you would send multiple SIGHUP signals to keepalived, irrespective to whether there were any changes to its configuration files.

It was clearly a bug in the daemon, but at this point we were used to work around its quirks, so it hasn’t taken a lot of time to come up with a special signal throttler for keepalived. What it does is it introduces 3 second delays between consequent SIGHUP signals sent to keepalived instances. Why 3 seconds? No particular reason, except that it worked (anything below 2 seconds didn’t), and it seemed like a good idea to give keepalived a chance to send at least a single health check VRRP message between reload requests, so we made it slightly longer than the default health check interval which is 2 seconds for Neutron.

Reading logs

So how do I know that an HA router actually sent gratuitous ARP packets without having access to a live machine? Let’s say all I have is log files for Neutron services.

For those packets that are sent by keepalived itself, it logs a message per advertised IP address in syslog, as seen in a snippet provided earlier.

As for packets issued by neutron-keepalived-state-change daemon, corresponding messages were originally logged in a file that was located in a directory that also contained other files needed for the router, including keepalived configuration and state files. The problem here is that once a HA router is unscheduled from an L3 agent, it stops keepalived and cleans up both the router namespace as well as all files used by the router, including log files for neutron-keepalived-state-change. It means that after the router is gone, you can’t get your hands on the daemon log file. You are left in darkness as to whether it even called to arping.

To facilitate post-cleanup debugging, in Pike release cycle we’ve made the daemon to log to system log in addition to its own log file. With the patch, we can now see the daemon messages in system journal, including those corresponding to arping execution.

Apr 28 20:56:00 ubuntu-xenial-rax-ord-8650506 neutron-keepalived-state-change[20945]: 2017-04-28 20:56:00.338 20945 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ip', 'netns', 'exec', 'qrouter-433765a8-f084-4fbd-9aea-447835c32b09@testceeee6ac', 'arping', '-A', '-I', 'qg-c317683_6ac', '-c', '3', '-w', '4.5', '10.0.0.215'] create_process /opt/stack/new/neutron/neutron/agent/linux/utils.py:92
Apr 28 20:56:00 ubuntu-xenial-rax-ord-8650506 sudo[24549]:    stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/ip netns exec qrouter-433765a8-f084-4fbd-9aea-447835c32b09@testceeee6ac arping -A -I qg-c317683_6ac -c 3 -w 4.5 10.0.0.215
Apr 28 20:56:02 ubuntu-xenial-rax-ord-8650506 neutron-keepalived-state-change[20945]: 2017-04-28 20:56:02.430 20945 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /opt/stack/new/neutron/neutron/agent/linux/utils.py:153

Now whenever you have a doubt whether gratuitous ARP packets were sent by a Neutron HA router, just inspect syslog. You should hopefully find there relevant messages, either from keepalived itself or from neutron-keepalived-state-change calling to arping.

In the next post of the series, we will start looking at a particular ARP related failure Red Hat Networking Team hit lately in RH-OSP 11 (Ocata) CI environment and that, I figured, could be of general interest to Neutron developers and operators.

Introduction into gratuitous ARP

The world is a jungle in general, and the networking game contributes many animals. // RFC 826

This post is the very first in a series that will cover gratuitous ARP and its relation to OpenStack. There will be six posts in the series. My plan is to post them all in span of the next two weeks. You can find the list of “episodes” already published below.

This post will start the series with a discussion of what gratuitous ARP is, and how it helps in IT operations. Later posts will touch on how it applies to OpenStack and Linux kernel, and also discuss several issues that you may want to be aware of before building your OpenStack operations on the protocol.

Let’s dive in.

ARP (Address Resolution Protocol)

Motivation

ARP is one of the most widely used protocols in modern networks. Its history goes back to early 80s into the times of DARPA backed internetworking experiments. The very first RFC 826 that defined the protocol is dated November 1982 (it’s 35 years old at the time of writing). Despite the age, it’s still a backbone of local IPv4 network connectivity. Even in 2017 (the year I draft this post), it’s still very hard to find a IPv6-only network node, especially outside cloud environments. But that’s IPv4, so how does ARP fit into the picture?

To understand the goal of ARP, let’s first look at how network nodes are connected. The general model can be described as a set of hosts, each having one or more Network Interface Controller (NIC) cards connected to a common data link fabric. This fabric comes in different flavors (Ethernet, IEEE 802.11 aka WiFi, or FireWire). Irrespective of particular fabric flavor, all of them provide similar capabilities. One of the features that are expected from all of them is some form of endpoint addressing, ideally globally unique, so that network hosts connected to a shared medium can distinguish each other and address transferred data to specific peers. Ethernet and IEEE 802.11 are probably the most popular data link layers in the world, and since they are largely identical in terms of NIC addressing, in next discussions we will assume Ethernet fabric unless explicitly said otherwise.

For Ethernet, each NIC card produced in the world gets a unique 48-bit long hardware address allocated by a vendor under IEEE supervision that guarantees that no hardware address is allocated to two NIC cards. Uniqueness is to ensure that whichever hardware you plug into your network, it will never clash in address space with any other card also attached to the network. An example of a EUI-48 address would be, in commonly used notation, f4:5c:89:89:cd:54. These addresses are widely known as MAC addresses, and so I will also use this term moving forward.

It all means that your NIC already has a unique address, so why do you even need IP addresses? Sadly, people are bad at memorizing 48 randomized bits, so an easier scheme would be handy. Another problem is whenever your NIC dies and you replace it with a new one, the new card will have another unique address, and so you would need to advertise the new MAC address to all your network peers that may need to access your host.

And so engineers were looking for a better scheme to address network hosts. One of those successful alternative addressing proposals was IPv4. In this scheme, IPv4 addresses are defined 32-bit long. Still a lot, but the crucial point is that now you could pick addresses for your NIC cards. With that freedom, you could pick the same bit prefix for all your hosts, distinguishing them by a shorter number of trailing bits, and memorize just those unique bits, and configure your networking software to use the same prefix for network communication with other hosts. Then whenever you want to address a host, you pass unique trailing bits assigned to the host into your networking stack and allow it to produce the resulting address by prepending the common prefix.

The only problem with this approach is that now you have two address schemes: MAC addresses and IP addresses, with no established mapping between them. Of course, in small networks, you could maintain static IP-to-MAC mappings in sync on every host, but that is error prone and doesn’t scale well.

And that’s where ARP comes in to the stage. Instead of maintaining static mappings across hosts, the protocol allows to dynamically disseminate the information on the wire.

Quoting the abstract of RFC 826:

Presented here is a protocol that allows dynamic distribution of the information needed to build tables to translate an address A in protocol P’s address space into a 48.bit Ethernet address.

And that’s exactly what we need.

While the abstract and even the RFC title talk about Ethernet, the mechanism rendered so successful that it was later expanded to other data links, including e.g. FireWire.

Basics

The protocol introduces both ARP packet format as well as its state machine. Sadly, the RFC doesn’t contain a visual scheme for ARP packets, but we can consult the protocol Wikipedia page.

The RFC describes an address translation (ARP) table for each host storing IP-to-MAC mappings. It also defines two operations: a REQUEST and a REPLY. Whenever a host wants to contact an IP address for which there is no mapping in the local ARP table, the host sends a REQUEST ARP packet to broadcast destination MAC address asking the question “Who has the IP address?” Then it’s expected that the host carrying the IP address will send a REPLY ARP packet back with its own MAC address set in “Sender hardware address” field. The original host will then update its ARP table with a new IP-to-MAC mapping and will use the newly learned value as a destination MAC address for all communication with the IP address.

One thing to clarify before we move forward: this is all true assuming both interacting hosts are on the same layer-2 network segment, without an IP gateway (router) in between. If hosts are located in different segments, then connection between them is established through a router. In this case, a host willing to communicate with a host in another segment will determine that fact by inspecting its IP routing table. Since the destination IP address then would not belong to the local network IP prefix, the host will instead send the data to the default router IP address. (Of course, at this point the host may also determine that its ARP table doesn’t contain an entry for the gateway IP address yet, in which case it will use ARP to learn about the router MAC address.)

ARP table invalidation

One interesting aspect of the original RFC is that it doesn’t define a mechanism to update existing ARP table entries with new MAC addresses. Back in 1982, it was probably widely assumed that mobile IP stations roaming across network segments changing devices used to connect to outside world on the fly (think about how your smartphone seamlessly switches from WiFi to LTE) were not a too realistic use case. But even then, in “Related issue” section of the document, some ideas on how it could be implemented if needed were captured.

One suggestion was for every host to define “aging time” for its ARP entries. If a peer host is detected as unreachable (probably because there was no incoming traffic using both the MAC and IP addresses stored in ARP table), the originating host could remove the corresponding ARP entry from its table after it’s “aged”. This mechanism is indeed used in most modern ARP implementations, with 60 seconds being the common default for Linux systems (can be overridden using gc_stale_time sysctl setting).

It means that your connectivity to a roaming IP host will heal itself after a minute of temporary down time. While that’s great, some use cases would benefit from a more rapid reaction of hosts to network changes.

And that’s where gratuitous ARP comes into play.

Gratuitous ARP

Gratuitous ARP is an ARP packet that was never asked for (hence its alternative name – unsolicited ARP). RFC 826, “Related issue” section, mentions an algorithm to update existing ARP table entries in the network based on unsolicited ARP packets. But it’s only RFC 2002, “IP Mobility support” from year 1996 that made it part of a standard and introduced the very term “gratuitous ARP”.

RFC 2002 discusses protocol enhancements for IP networks to allow for IP devices roaming across networks without introducing significant connectivity delays or disruptions. Among other things, it defines the algorithm to be used to update existing ARP table entries with new MAC addresses. For this matter, it adopts the proposal from RFC 826, where a host can broadcast a gratuitous ARP packet into a network, and its peers then update their tables with the new MAC address sent, restoring connectivity even before old ARP entries expire.

There are two main use cases for gratuitous ARP. One is to quickly switch between multiple devices on the same host. Another is to move services exposed through an IP address from one host to another transparently to network peers.

This last scenario may happen either as part of a planned action on behalf of an Ops team managing a service, or triggered by a self-healing mechanism used in networks to guarantee availability of services in case of software or network failures. One piece of popular software that allows to fail over IP addresses from one host to another is keepalived which uses the VRRP protocol to negotiate between hosts which node should carry IP addresses managed by the software.

In OpenStack Neutron, gratuitous ARP is how floating IP addresses roam between ports; they also help with failing over IP addresses between HA router instances.

In the next post, I will expand on how OpenStack Neutron uses and implements gratuitous ARP.