IPv4 connectivity issues between RZOB and WHQ (monitoring) data center

Incident Report for Flying Circus

Postmortem

We have received feedback from our data center operator about the outage.

The effects we saw and communicated so far are all correct.
The incident started with the physical uplink between Oberhausen and Frankfurt being interrupted on the upstream carrier level.
The redundancy took over properly for all IPv6 traffic but incompletely for the IPv4 traffic.
The data center manually repaired the incorrect redundancy which restored service. Debugging the redundancy issue caused a few short routing flaps in the hour after.
The equipment vendor (Juniper) has acknowledged that the system design and specific configuration was not at fault and is currently suspecting and investigating a bug in the firmware.
Once a patch becomes available the data center will apply those in a scheduled maintenance.

One practical conclusion from our side is that customer setups not yet leveraging IPv6 can further improve their resiliency in this specific situation. We will contact customers without dual stack setups (concurrently running IPv4 and IPv6 on the frontends) to implement dual stack where possible.

Even though this outage has only been partial, it has affected our customers for roughly an hour - which is longer than we strive for on this level of infrastructure. Our data center SLAs are targetting 99.99% – which on average allows for about 4 minutes of downtime during a month or 53 minutes a year - which hasn’t been the case here.

We are generally closely in touch with data center personnel during critical situations like this. However, initially we had a hard time reaching them as their phone system was affected as well. We have been provided with a separate backup number for future incidents. Additionally we are still discussing options to improve the “time to recovery” in future situations like this to stay true to the goal of 99.99% availability on the data center infrastructure level.

Posted Mar 26, 2024 - 10:00 CET

Resolved

We've been in touch again with the colleagues from the data center and the situation has been operational and stable now.

The major outage on IPv4 was from 1:11 until 2:07 (Europe/Berlin).

Between 2:07 and 2:59 we saw a few (around 3 or 4) interruptions of roughly 15 seconds due to BGP session cutovers.

We are wrapping things up for tonight. We will get in touch with the DC personnel to get a better understanding of what happened and will update this incident with a post mortem.

Posted Mar 09, 2024 - 03:27 CET

Update

We have received feedback from our data center that they've reached a stable solution. We're still monitoring the situation for a bit and will be in touch with the data center for a detailed analysis next week.

Posted Mar 09, 2024 - 03:09 CET

Update

We've seen another IPv4 routing issue in the data center backbone. We are in touch with the data center personnel who are working on it and we'll keep you updated.

Posted Mar 09, 2024 - 02:35 CET

Update

We are continuing to monitor for any further issues.

Posted Mar 09, 2024 - 02:25 CET

Monitoring

After a change in routing, our monitoring from WHQ and other providers is able to reach RZOB again.

Posted Mar 09, 2024 - 02:11 CET

Update

We are seeing routing issues beyond our control at the moment and are contacting our data center provider. IPv4 connectivity to multiple providers is down but our data center is still reachable from some networks.

Posted Mar 09, 2024 - 01:57 CET

Investigating

We are experiencing IPv4 connectivity issues between our main data center RZOB and WHQ where our monitoring systems are. IPv6 is not affected. We are investigating the scope and issue of the problem.

Posted Mar 09, 2024 - 01:31 CET

This incident affected: Central services.