Inconsistent file systems on some VMs
Incident Report for Flying Circus
Resolved
Here is our preliminary analysis on what happened on the last weekend.

Summary:

A complex failure condition while recovering crashed Ceph server processes lead to VM filesystem corruption.

Timeline:

Week starting from 2014-06-02 - Three Ceph OSD (object storage daemon) processes crashed in the course of the week due to a glibc bug. The bug is known and there is a fix available in newer glibc versions. Our monitoring did not report the crashed processes due to a "false negative" bug in the monitoring code.

2014-06-07 14:20 - An admin noticed that OSD processes have been crashed and tried to restart the inactive instances. While doing so, the simultaneous restart of several OSD instances led to an overflow of two kernel tables (nf_conntrack and the file descriptors). This caused both failed network connections and disk operations on the affected servers.

2014-06-07 14:30 - Another attempt was made to revive the crashed OSDs, this time one after another. The OSD instances started successfully.

2014-06-07 14:50 - We started to notice that the filesystems of some VMs were in a bad state. Our standby supporters were called and we started recovery action.

Scope:

An estimated 15% of all VMs were hit by filesystem corruption in varying degrees. The exact amount of the damage depends on the distribution of filesystem blocks between different storage servers and the write rate. Some databases reported their storage spaces to be inconsistent.

Recovery:

We have identified affected VMs. Some of them only got minor damages which were easy to resolve (e.g., by re-installing some files). Most of the affected VMs had to be restored from backups. We found that backups made on Friday were generally usable. After restore, changes made after the date of the last successful backup (Friday morning in most cases - depends on individual VM backup schedule) were lost. We managed to restore most VMs until Saturday night, but a few VMs were not restored until Tuesday morning.
Technical details of the error:

OSD recovery always results in a load spike on the affected servers. This time, the nf_conntrack table (Linux firewall connection tracking) overflowed. The affected storage servers randomly refused new connections for both inter-storage traffic and client access. This lead to a large amount of half-completed storage transactions and storage servers seemingly appearing and disappearing in a high frequency on the storage network. The presence of conntrack rules is considered a bug in the configuration code since connection tracking is not used on storage servers. We also ran into a file descriptor table overflow on two storage servers. This caused updates not being written to disk during recovery, resulting in bad data. While Ceph defaults with regard to the number of open files are generally good, our use case with 8 OSD processes per server was not properly reflected in the configuration. The per-process limit of open files should have been set to a lower value. In addition, we are currently running on n+1 redundancy, i.e. there is one additional copy of each piece of data besides the master copy. In case of differences this means that there is no majority vote due to the fact that we have an even number of copies.

To summarize, we have experienced a complex condition with several failures taking place at once. This failure mode was previously unknown so our admins were not prepared to foresee it.

Short-term fixes:

We don't want to experience a situation like this again. To prevent it, we are currently implementing these changes:

* Go to n+2 redundancy. This means that we will keep 3 copies of each piece of data around.
* Disable network connection tracking on storage servers to prevent future nf_conntrack table overflows.
* Reduce the number of open files for each OSD process to cater for the fact that we are running 8 of them per server.
* Improve monitoring so that we will spot dead OSD processes right away.

Long-term fixes:

We are planning to upgrade glibc as part of the next platform OS update so that OSD processes won't crash in the first place.


Again, we apologize for the outages and the trouble they caused.
Posted over 4 years ago. Jun 12, 2014 - 19:29 CEST
Update
A first error analysis revealed that inconsistencies started to build up in the storage cluster from Fri 2014-06-06 21:30 CEST. As they did not lead to immediately visible errors, no alarm was raised. An administrator spotted signs of trouble during a routine system status review around 14:15 CEST on the next day. An attempt to restore the storage system's integrity failed. At least, from this point on inconsistencies, which built up in the dark, started to show up clearly. Beginning at 15:00 CEST, our standby supporters were notified and started to handle the incident.

Currently, we still do not know what caused the inconsistencies to build up in the first place. We are still actively investigating the issue and will provide updates here as soon as we gain additional insights.
Posted over 4 years ago. Jun 11, 2014 - 14:03 CEST
Monitoring
Known-broken databases have been repaired or restored from the backups now. Affected customers have been informed.

As an additional safety measure, we will run filesystem checks on potentially affected database servers (ZEO/ZODB VMs) now. This means that the VMs need to be rebooted and are unavailable for a short amount of time. Please excuse that we don't use the regular scheduled maintenance mechanism this time, but reboot rather instantly. We would like to identify potential inconsistencies as soon as possible before too much updates go into the databases. A short downtime is better than inconsistent data sleeping in the dark. We apologize for service interruptions.
Posted over 4 years ago. Jun 08, 2014 - 13:31 CEST
Identified
Today we observed filesystem inconsistencies on some VMs. This causes some databases to refuse updates due to checksum or data structure errors. Affected customers will be informed directly, since only a few VMs seem to be hit.

The problem has likely been caused by a double-fault in the storage subsystem which lead to a split-brain situation. We are currently in the process of identifying the root cause. We apologize for inconveniences.
Posted over 4 years ago. Jun 08, 2014 - 00:52 CEST