This post is the fourth in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also set the stage for the Failure, and dismissed easier causes of it.
This post will continue discussion of the Failure beyond OpenStack components. It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this “episode”.
To recap, a bunch of tempest connectivity tests were failing in RH-OSP 11 environment when connecting to a floating IP address. Quick log triage suggested that internal connectivity for affected instances worked fine, and that Neutron L3 agent called to arping tool at the right time. Still “undercloud” node (the node executing tempest) got timeout failures.
Before diving deeper, let’s make a step back and explore the deployment more closely. How are those floating IP addresses even supposed to work in this particular setup? Is the tempest node on the same L2 network segment with the floating IP range, or is it connected to it via a L3 router?
The failing jobs deploy Red Hat OpenStack using Director also known as TripleO. It’s not a very easy task to deploy a cloud using bare TripleO, and for this reason there are several tools that prepare development and testing environments. One of those tools is tripleo-quickstart, more popular among upstream developers, while another one is Infrared, more popular among CI engineers. The failing jobs were all deployed with Infrared.
Infrared is a very powerful tool. I won’t get into details, but to give you a taste of it, it supports multiple compute providers allowing to deploy the cloud in libvirt or on top of another OpenStack cloud. It can deploy both RH-OSP and RDO onto provisioned nodes. It can use different installers (TripleO, packstack, …). It can also execute tempest tests for you, collect logs, configure SSH tunnels to provisioned nodes to access them directly… Lots of other cool features come with it. You can find more details about the tool in its documentation.
As I said, the failing jobs were all deployed using Infrared on top of a powerful remote libvirt hypervisor where a bunch of nodes with distinct roles were created:
- a single “undercloud” node that is used to provision the actual “overcloud” multinode setup (this node also runs tempest);
- three “overcloud” controllers all hosting Neutron L3 agents;
- a single “overcloud” compute node hosting Nova instances.
All the nodes were running as KVM guests on the same hypervisor, connected to each other with multiple isolated libvirt networks, each carrying a distinct type of traffic. After Infrared deployed “undercloud” and “overcloud” nodes, it also executed “neutron” set of tests that contains both api and scenario tests, from both tempest and neutron trees.
As I already mentioned, the “undercloud” node is the one that also executes tempest tests. This node is connected to an external network that hosts all floating IP addresses for the preconfigured public network, with eth2 of the node directly plugged into it. The virtual interface is consequently plugged into the hypervisor external Linux kernel bridge, where all other external (eth2) interfaces for all controllers are plugged too.
What it means for us is that the tempest node is on the same network segment with all floating IP addresses and gateway ports of all Neutron routers. There is no router between the floating IP network and the tempest node. Whenever a tempest test case attempts to establish an SSH connection to a floating IP address, it first consults the local ARP table to possibly find an appropriate IP-to-MAC mapping there, and if the mapping is missing, it will use the regular ARP procedure to retrieve the mapping.
Now that we understand the deployment, let’s jump into the rabbit hole.
The initial debugging steps on OpenStack Neutron side haven’t revealed anything useful. We identified that Neutron L3 agent correctly called to arping with the right arguments. So in the end, maybe it’s not OpenStack fault?
First thing to check would be to determine whether arping actually sent gratuitous ARP packets, and that they reached the tempest node, at which point we could expect that the “undercloud” kernel would honor them and update its ARP table. I figured it’s easier to only capture external (eth2) traffic on the tempest node. I expected to see those gratuitous ARP packets there, which would mean that the kernel received (and processed) them.
Once I got my hands on hardware capable of standing up the needed TripleO installation, I quickly deployed the needed topology using Infrared.
Of course, it’s a lot easier to capture the traffic on tempest node but analyze it later in Wireshark. So that’s what I did.
$ sudo tcpdump -i eth2 -w external.pcap
I also decided it may be worth capturing ARP table state during a failed test run.
$ while true; do date >> arptable; ip neigh >> arptable; sleep 1; done
And finally, I executed the tests. After 40 minutes of waiting for results, one of tests failed with the expected SSH timeout error. Good, time to load the .pcap into Wireshark.
The failing IP address was 10.0.0.221, so I used the following expression to filter the relevant traffic:
ip.addr == 10.0.0.221 or arp.dst.proto_ipv4 == 10.0.0.221
The result looked like this:
Here, we see the following: first, SSH session is started (frame 181869), it of course initially fails (frame 181947) because gratuitous ARP packets start to arrive later (frames 181910, 182023, 182110). But then for some reason consequent TCP packets are still sent using the old 52:54:00:ac:fb:f4 destination MAC address. It seemed like arriving gratuitous ARP packets were completely ignored by the kernel. More importantly, the node continued sending TCP packets to 10.0.0.221 using the old MAC address even past expected aging time (60 seconds), and it never issued a single ARP REQUEST packet to update its cache throughout the whole test case execution (!). Eventually, after ~5 minutes of banging the wrong door the test case failed with a SSH timeout.
How could it happen? The kernel is supposed to honor the new MAC address right after it receives an ARP packet!
Now, if we would compare the traffic dump to a successful test case run, we could see the traffic capture that looked more like below (in this case, the IP address of interest is 10.0.0.223):
Here we see TCP retransmission of SSH, port 22, packets (74573 and 74678), which failed to deliver because the kernel didn’t know the new fa:16:3e:bd:c1:97 MAC address just yet. Later, we see a burst of gratuitous ARP packets sent from the router serving 10.0.0.223, advertising the new MAC address (frames 74715, 74801, and 74804). Though it doesn’t immediately suggest that these were gratuitous ARP packets that healed the connectivity, it’s clear that the tempest node quickly learned about the new MAC address and continued with its SSH session (frames 74805 and forward).
One thing that I noticed while looking at multiple traffic captures from different test runs is that whenever a test failed, it always failed on a reused floating IP address. Those would show up in Wireshark with the following warning message:
Then maybe there was a difference in ARP entry state between successful and failing runs? Looking at the ARP table state dumps I collected during a failing run, the following could be said:
- Before the start of the failed test case, the corresponding ARP entry was in STALE state.
- Around the time when gratuitous ARP packets were received and the first TCP packet was sent to the failing IP address, the entry transitioned to DELAY state.
- After 5 seconds, it transitioned to REACHABLE without changing its MAC address. No ARP REQUEST packets were issued in between.
- The same STALE – DELAY – REACHABLE transitions were happening for the affected IP address over and over. The tempest node hasn’t issued a single ARP REQUEST during the whole test case execution. Neither it received any new traffic that would use the old MAC address.
If we compare this to ARP entries in “good” runs, we see that there they also start in STALE state, then transition to DELAY, but after 5 seconds, instead of transitioning to REACHABLE, it switched to PROBE state. The node then issued a single ARP REQUEST (could be seen in the captured traffic dump), quickly received a reply from the correct Neutron router, updated the ARP entry with the new MAC address, and only then finally transitioned to REACHABLE. At this point the connectivity healed itself.
What made the node behave differently in those two seemingly identical cases? Why hasn’t it issued a ARP probe in the first case? What are those ARP table states anyway? I figured it was time to put my kernel hat on and read some Linux code.
In the next post in the series, I am going to cover my journey into kernel ARP layer. Stay tuned.