Uplink outage
Incident Report for Flying Circus
We have restarted both routers and performed a successfull failover.

A few very short-lived checks (10s timeouts) caused spurious warnings but otherwise no customer traffic was affected during the failover and the reboots.
Posted 3 months ago. Apr 22, 2018 - 17:54 CEST
Earlier today, between 13:00 and 13:30 CEST we had an outage of our uplink in the data center. We are currently investigating the root cause and preparing preventative measures.

None of our logging shows suspicious activity, but our metrics indicates that the primary router was spending 100% of CPU time in kernel mode and was dropping inbound packets on the uplink. For an unknown reason our automatic failover did not trigger under this condition.

As we suspect that this may be related to internal kernel data structures or maybe even driver issues we will perform a manual router failover and reboot both routers after another to clear any pending issues in drivers or kernel data structures.
Posted 3 months ago. Apr 22, 2018 - 17:42 CEST