Partial loss of IPv4 Internet Connectivity in RZOB

Incident Report for Flying Circus

Resolved

We have solved the issue which manifested sporadically in two ways:

1. DNS queries that required our border routers to retrieve data from the internet timed out
2. Traffic from VMs to the internet that required NAT timed out

The issue was caused by a new transfer network that was introduced in the maintenance earlier today. It was generally carrying traffic to our data center, but traffic destined for addresses on the network itself was lost.

Our routers dynamically used transfer net addresses when sending traffic to the outside world (i.e. for DNS and NAT as described above). To resolve the issue we have changed our router configs to use source addresses from our networks which are properly routed.

We've tested the configuration in the redundant configuration using fail-overs and have seen the issues on affected machines subside.

We're sorry for the outage.

Posted Jul 09, 2025 - 00:07 CEST

Update

We have achieved a semi-stable routing state again, where connectivity has mainly been restored.
DNS resolution within RZOB is currently not stable.

Posted Jul 08, 2025 - 23:16 CEST

Investigating

Unfortunately, our triaging attempt caused additional connectivity issues, now also via IPv6. These issues only affect some machines, but the underlying cause is not identified yet.

Posted Jul 08, 2025 - 23:10 CEST

Identified

Due to the changes in the announced RZOB network uplink maintenance, the RZOB data centre lost some of its IPv4 connectivity.

We are triaging this by reverting the routing changes to the state before the maintenance right now to restore IPv4 connectivity.

Note: The revert itself will also briefly interrupt the data centre connectivity again.

Posted Jul 08, 2025 - 23:05 CEST

This incident affected: RZOB (production) (Network and Internet uplink).