Anatomy of a CephFS Disaster

This post describes in detail how we ended up with a damaged CephFs and our attempts to fix it.

Background

On 3rd September we scheduled some downtime to reconfigure the network of our ceph cluster. Prior to that we used two networks: one frontend network used by the ceph clients (and the rest of the machines in our server room) and one private network used exclusively by the ceph cluster for internal traffic. Recently we upgraded our switches to 10GE. We figured that should be plenty of bandwidth for all our ceph traffic and decided to retire the cluster network. The cluster network is used by the OSDs. In order to reconfigure them they had to be restarted and the cluster quiesced. We followed the instructions in this mailing list post:

we stopped all our RBD using clients
we shut down all our VMs and marked the CephFS as down ceph fs set one down true
we set some clust flags, namely noout, nodown, pause, nobackfill, norebalance and norecover
waited for ceph cluster to quieten down, reconfigured ceph and restart the OSDs one failure domain and a time

Once all OSDs had been restarted we switched off the cluster network switches and made sure ceph was still happy. We then reversed the procedure.

Out of the Frying Pan

Ceph didn't report any problems, our fileservers that use RBDs came back happily. I then re-enabled the cephfs. I was running ceph -w to watch for any problems and noticed that our MDS fell over with the following error
replayed ESubtreeMap at 8537805160800 subtree root 0x1 not in cache failure replaying journal (EMetaBlob)
I changed the number of active MDS to 1 and restarted the MDS. The restarted MDS did not rejoin the cluster and eventually the cephfs ran out of active MDS and crashed.

My memory of events and what we did gets a bit hazy here. Thinking we had a disaster at hand we initiated the disaster recovery procedure and reset the journal. The MDS started again and we had our cephfs back. Unfortunately, it crashed as soon as we started writing data to it.

Into the Fire

We noticed a whole bunch of
bad backtrace on directory
errors. Some more reading suggested that we should do a scrub of the filesystem. We did that but were too impatient and did not let it finish before we started using the filesystem again. The cephfs crashed again on write.

At this stage we decided to follow the rest of the disaster recovery procedure. Our cephfs contains ~40TB of data. It took the 4 workers to scan the file extents over 4 days. During this time we observed a ceph read activity between ~500 op/s and ~2 kop/s. It would have been helpful if the documentation gave some hints as to how long a very long time is and how many workers are a reasonable number for a given size file system. For the second phase of scanning the inodes we used 16 workers which completed the task in a few hours. During this phase we maintained a ceph read activity of ~60 kop/s. It would also be useful to know if this process can be distributed over a number of machines. The remaining phases completed relatively quickly.

Resolution

Rereading the disaster recovery documentation suggested that the cleanup phase is optional. We reasoned that we can reactivate the cephfs. At this stage it was still in a failed state. Running
ceph mds repaired 0
marked the filesystem as repaired and ready to be used. We then started a filesystem scrub and repair
ceph tell mds.a scrub start / recursive repair
which found some issues that ended up in lost+found.

We then tested the filesystem by writing data. It passed the test and we fired up our VMs again. We are back in business.

Conclusion

Reconstructing a 40TB distributed filesystem takes a long time. The filesystem scrub and repair should have done the trick to fix the issues we saw. We do need to chase up some issues where we get regular error messages that clients are failing to respond to cache pressure.

Update: 10/09/2020

In a, perhaps foolish, attempt to be tidy I decided to delete the entries in the lost+found directory. This crashed the MDS. One by one the MDS crashed, restarted and gave up after a few attempts. Eventually all the standby MDS were used up. This is similar to what we saw in the first place. I then applied the lessons I learned over the last few days and

put on the emergency brakes (the MDS were all down anyway) ceph fs fail one
restarted all the MDS, eg systemctl reset-failed ceph-mds@store09.service systemctl start ceph-mds@store09.service
marked the cephfs as up ceph fs set one joinable true
and finally restarted the scrub ceph tell mds.store08 scrub start / recursive repair

NB: for historic reasons our cephfs is called one.

The good news is that all the cephfs clients continued during the brief downtime. But we still need to figure out how to tidy up the stuff in lost+found. One issue might be that our clients are quite a bit older.