During the planned network maintenance in our data center earlier this week the internet uplink of your services was lost for about one hour.
We are sorry for the unexpected and prolonged downtime.
Our investigation showed a complex interaction of multiple factors that led to the outage and also a delayed remediation:
- The data center performed the planned software upgrades on their switches which required a reboot of those switches. The reboot of each switch took around 30 minutes. Switches for customers with redundant uplinks were rebooted sequentially after the previous switch finished its update and reboot and all the customer ports were online again.
- We have fully redundant routers with automatic failover both on our side of the data center and our data center provider’s side. We use keepalived and BGP for automatic fail-overs. When the first switch rebooted the failover happened automatically and without any noticeable impact for our customers.
- Our routers use data-center grade Intel X710 network cards on a 10G fibre connection for our uplink to the data center. We identified a firmware bug in those routers where after a loss of link (like it happens during a switch reboot) the link may not come back up correctly. We noticed this after the fail-over.
- As we have good relationships with the data center personnel we contacted them through personal messenger to remind them of this situation to let us know when the switch comes back up so we can ensure the link getting re-established. Unfortunately, due to being focussed on the maintenance, this message wasn’t read until after the incident was resolved.
- We unsuccessfully tried to trigger the appropriate commands on the first router with the affected links but were not aware of the status of the switch on the other side of the link. We tried recovering the link for about 30 minutes until giving up. We later learned that the remote switch took around 33 minutes to reboot and that there was a gap of around 45 minutes until the second switch would be rebooted.
- The switches that were rebooted did show the link to be operational correctly, but didn’t indicate any issue from the data center side. The BGP session was dropped, but unfortunately - as the maintenance was focussed on L2 connectivity - wasn’t checked for healthiness before continuing to reboot the second switch.
- The reboot of the second switch caused our data center internet access to be completely cut off and the second router was also trapped in the Intel firmware bug.
- We got alerted that data center internet connectivity was lost and started investigating the situation. Using
traceroute we saw the connectivity was lost a few hops in front of our routers and we considered this to be a data center wide issue. The maintenance notification from the data center indicated that they may experience short connectivity losses and we decided to let them work through their maintenance and avoid bombarding them with requests for known issues.
- It later turned out that the issue was not data center wide, but that a specific hop in front of our routers did not respond to
traceroute requests while our downstream routers were offline. The check we performed thus gave us false information.
- Unfortunately, the data center did not expect any outage for customers with redundant uplinks like ours and did not see widespread issues.
- We monitored the situation and saw changing results through our
traceroute requests while still under the impression of getting reliable results. After 38 minutes we contacted the data center emergency support and immediately got the information that they were not aware of any issues.
- Within the next 20 minutes we were in touch with the proper personnel at the data center, analyzed the situation and talked the hands on support through the process of logging into one of our routers' console and typing the proper recovery command (a simple
ethtool -r ethtr on one of the routers sufficed).
- We saw immediate service recovery after this and were able to revive the second router through our regular out-of-band network.
Learnings and next steps
We analysed the incident internally and collaborated with the data center personnel and other network operators from the DENOG and came to the following conclusions and next steps:
- The firmware situation and reliability for the widely used Intel X710 cards is confusing at best. We will investigate both updating the firmware and also replacing this type of card in one of the routers with a card from a different manufacturer (Mellanox comes to mind) to ensure that the redundant routers will be less prone to identical bugs.
- The data center will communicate expectations about downtimes during maintenance more clearly (length of expected outages and whether redundant customers are expected to be affected or not).
- The data center will pay more attention to BGP sessions being down during maintenance and has added additional monitoring on our connection.
- We agreed with the data center to place calls earlier and rather one time too many than to wait for a prolonged time as the information available to them and us might be inconsistent during incidents.
- We have added (and tested) an automatic remedy script that will trigger in the event of the known bug and recover the port status quickly on its own.
- We are considering to add another out-of-band layer using an LTE router tied to our out-of-band network with a VPN. The data center has offered to provide a direct LTE antenna from the roof to our racks to ensure proper reception.
Let me close by apologising for the outage. We want you to be able to trust in our abilities to keep your data safe and your applications running reliable and even though outages happen we want them to be as few and as far apart as possible.
We guarantee our customers a 99.5% availability of their applications to their end users and we strive to reach 99.9% every single month. As the internet uplink was down for about an hour, so were your applications. And even though this still means we’ll meet the contractual SLAs of 99.5% we did not achieve our higher internal standards of 99.9% and will take this as a spur to implement the improvements we identified in a timely manner.