The Failure, part 4: Summary

This post is the sixth, and the last, in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is and how it’s implemented in OpenStack Neutron L3 agent. We also introduced the Failure, covered its initial triaging, looked at traffic captured during failed test runs, and digged through Linux kernel code only to find multiple bugs in its ARP layer.

In this final post we will summarize what we have learned about the Failure. In addition, we will also discuss a possible fix that could help alleviate the impact while we are waiting for new kernels.

It is advised that you make yourself comfortable with previous posts in the series before proceeding with reading this last “episode”. Otherwise you loose all the fun of war stories.

Summary

In the previous post of the series, we finally learned what went wrong with the Failure, and we figured there is no OpenStack fault in it. Instead, it is Linux kernel that ignores all gratuitous ARP packets Neutron L3 agent eagerly sends its way; and it is the same Linux kernel that is spinning bogus ARP entries in perpetual STALE – DELAY – REACHABLE state change loop without a way out.

And that’s where we are at. We have the following alternatives to tackle the test failure problem.

or

  • Include the patch that avoids touching n->updated if an ARP packet is ignored;

or

  • Include the patch that forces override for all gratuitous ARP packets irrespective of arp_accept;

or

(Of course, ideally all of those pieces would find their way into your setup.)

All of those solutions currently require kernel backports, at least for RHEL7. In the meantime, could we do something just on Neutron side? On first sight, it sounds like a challenge. But when you think about it, we can use the knowledge about the failure mode to come up with a workaround that may work in most cases.

We know that the issue would not show itself up in tests if only any of gratuitous ARP replies sent by Neutron would be honored by kernel, and they are all ignored because of arrival during locktime window.

We know that the default value for locktime is 1 second, and the reason why all three gratuitous ARP packets are ignored is because each of them land into the moving locktime window, which happens because of the way we issue those packets using the arping tool. The default interval between gratuitous ARP packets issued by the tool is 1 second, but if we could make it longer, it could help the kernel to get out of the moving locktime window loop and honor one of packets sent in burst.

From looking at arping(8) and its code, it doesn’t seem like it supports picking an alternative interval with any of its command line options (I have sent a pull request to add the feature but it will take time until it gets into distributions). If we want to spread gratuitous updates in time using arping, we may need to call the tool multiple times from inside Neutron L3 agent and maintain time interval between packets ourselves.

Here is the Neutron patch to use this approach. This would of course work only with hosts that don’t change locktime sysctl setting from its default value. Moreover, it’s very brittle, and may still not give us 100% test pass guarantee.

The chance of success can of course be elevated by setting arp_accept on all nodes. The good news is that at least some OpenStack installers already do it. For example, see this patch for Fuel. While the original reasoning behind the patch is twisted (gratuitous ARP packets are accepted irrespective of arp_accept, it’s just that locktime may get in their way in bad timing), the change itself is still helpful to overcome limitations of released kernels. To achieve the same effect for TripleO, I posted a similar patch.

Finally, note that while all the workarounds listed above may somewhat help with passing tempest tests, without the patch series tackling the very real spurious confirmation issue, you still risk getting your ARP table broken without a way for it to self-heal.

The best advice I can give is, make sure you use a kernel that includes the patch series (all official kernels starting from 4.11 do). If you consume kernel from a distribution, talk to your vendor to get the patches included. Without them, you risk with your service availability. (To note, while I was digging this test only issue, we got reports from the field some ARP entries were staying invalid for hours after failover.)

And finally… it’s not always an OpenStack fault. Platform matters.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s