High internal latency - service degradation

Incident Report for Flying Circus

Resolved

Apparently the cause was a misdirection of UDP packets intended for a virtual machine that was crashed and stuck in a non-functional state. This caused (after ARP expiry) that a lot of traffic that was directed to this machine (~30MBit/s) to be flooded to our whole network which in turn was amplified on KVM hosts through their network bridges into the VMs. This two-fold amplification caused a very high amount of CPU time being spend in the kernel, specifically on hosts with many virtual machines.

Restarting the VM caused the traffic flooding to stop. We are still analyzing the specific cause of the flooding to construct preventative measures in the future.

Posted Nov 09, 2018 - 15:36 CET

Monitoring

The situation has resolved after we rebooted a suspiciously acting VM.

We noticed a potential packet loop that caused at least one KVM server to spend almost all CPU time on dealing with the extreme packet flow and also suppressed regular traffic.

We're still investigating the cause.

Posted Nov 09, 2018 - 14:56 CET

Investigating

We are seeing parts of the network showing high latency causing monitoring alerts and service degradation. We are currently investigating the issue.

Posted Nov 09, 2018 - 14:40 CET

This incident affected: RZOB (production) (Network and Internet uplink).