The Failure, part 1: It’s OpenStack fault

I’m not actually good at computers I’m just bad at giving up. // mjg59

This post is the third in a series that covers gratuitous ARP and its relation to OpenStack. In the previous posts, we discussed what gratuitous ARP is, and how it’s implemented in OpenStack Neutron L3 agent.

This post is short and merely sets the stage for a story that started a month ago inside Red Hat Networking team and that ultimately led me into learning how ARP actually works in OpenStack and in Linux. In next posts, I will expand on the story and discuss possible solutions to the problem that I will introduce below. (The plan is to post all four remaining pieces in next four days.)

But let’s not get dragged too far.

The stage

Failure mode

It was two months before the next shiny Ocata-based release of Red Hat OpenStack Platform (RH-OSP 11). Now that the team focus shifted from upstream development to polishing the product, we started looking more closely at downstream CI jobs. As usual with new releases, there were several failures in our tempest jobs. For most of them we figured out a possible culprit and landed fixes. For most of them, except The Failure.

Well, it was not a single test case that was failing, more like a whole class of them. In those affected test jobs, we execute all tempest tests, both api and scenario, and what we noticed is that a lot of scenario test cases were failing on connectivity checks when using a floating IP (but, importantly, never a fixed IP). Only positive connectivity checks were failing (meaning, cases where connectivity was expected but failed; never a case where lack of connectivity was expected).

There are two types of connectivity checks in tempest: ping check and SSH check. The former sends ICMP (or ICMPv6) datagrams to an IP address under test for 120 seconds and expects a single reply, while the latter establishes a SSH session to the IP address and waits for successful authentication.

In the failing jobs, whenever a ping check failed, the following could be seen in the tempest log file:

2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager Traceback (most recent call last):
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 624, in check_public_network_connectivity
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     mtu=mtu)
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 607, in check_vm_connectivity
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     self.fail(msg)
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager   File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
2017-05-06 00:02:32.563 3467 ERROR tempest.scenario.manager     raise self.failureException(msg)

When it was a SSH check that failed, then the error looked a bit different:

2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     sock=proxy_chan)
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 305, in connect
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     retry_on_signal(lambda: sock.connect(addr))
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/util.py", line 269, in retry_on_signal
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     return function()
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 305, in
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     retry_on_signal(lambda: sock.connect(addr))
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh   File "/usr/lib64/python2.7/socket.py", line 224, in meth
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh     return getattr(self._sock,name)(*args)
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh timeout: timed out
2017-05-05 21:29:41.996 4456 ERROR tempest.lib.common.ssh
2017-05-05 21:29:41.997 4456 ERROR tempest.scenario.manager [-] (TestGettingAddress:test_multi_prefix_dhcpv6_stateless) Initializing SSH connection to 10.0.0.214 failed. Error: Connection to the 10.0.0.214 via SSH timed out.

If you would pick a single test case that sometimes failed, it didn’t have a high risk to impact job results, but once you aggregate all failures from all affected test cases, the chance of successfully passing the whole test run became abysmal, around 10%, which was clearly not ideal.

So I figured I may have a look at that, naively assuming it will take a day or two to find the root cause, fix it and move on with my life. Boy I was wrong! It took me a month to get to the bottom of it (though in all honesty, most of the time was spent trying to setup environment that would consistently reproduce the issue).

First steps

Initially, I figured failures are most often happening in our L3 HA jobs, so I focused on one of those. Reading through tempest, neutron-server, neutron-openvswitch-agent, and neutron-l3-agent log files hasn’t revealed much.

When looking at a particular test failure, we could see that the instance that was carrying the failing floating IP successfully booted and received a DHCPv4 lease for its fixed IP, as seen in its console log that tempest gladly dumps on connectivity check failures:

Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.100.0.11...
Lease of 10.100.0.11 obtained, lease time 86400

To  cross-check, we could also find the relevant lease allocation event in the system journal:

May 06 06:02:57 controller-1.localdomain dnsmasq-dhcp[655233]: DHCPDISCOVER(tap7876deff-b8) fa:16:3e:cc:2c:30
May 06 06:02:57 controller-1.localdomain dnsmasq-dhcp[655233]: DHCPOFFER(tap7876deff-b8) 10.100.0.11 fa:16:3e:cc:2c:30

The tempest log file clearly suggested that the failure was not because of a SSH key pair misbehaving for the failing instance. If that would be a public SSH key pair not deployed to the instance, we would not see SSH timeouts but authentication failures. Neither we saw any SSH timeouts for tests that established SSH sessions using internal fixed IP addresses of instances.

It all suggested that internal (“tenant”) network connectivity worked fine. The problem was probably isolated somewhere in Neutron L3 agent. But looking into Neutron L3 agent and neutron-server logs hasn’t surfaced any problem either: we could easily find relevant arping calls in Neutron L3 agent (for L3 legacy routers) or the system journal (for Neutron L3 HA routers).

Of course, you never believe that a legit fault may be a result of software that is external to your immediate expertise. It’s never compiler or other development tool fault; 99% of whatever you hit each day is probably your fault. And in my particular case, it was, obviously, OpenStack Neutron that was guilty. So it took me awhile to start looking at other places.

But after two weeks of unproductive code and log reading and adding debug statements in Neutron code, it was finally time to move forward to unknown places.

In the next post of the series, we will explore what I revealed once I shifted my accusatory gaze towards platform residing below OpenStack.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s