This post describes in detail how we ended up with a damaged CephFs and our attempts to fix it.
Background
On 3rd September we scheduled some downtime to reconfigure the network of our ceph cluster. Prior to that we used two networks: one frontend network used by the ceph clients (and the rest of the machines in our server room) and one private network used exclusively by the ceph cluster for internal traffic. Recently we upgraded our switches to 10GE. We figured that should be plenty of bandwidth for all our ceph traffic and decided to retire the cluster network. The cluster network is used by the OSDs. In order to reconfigure them they had to be restarted and the cluster quiesced. We followed the instructions in this mailing list post:
- we stopped all our RBD using clients
- we shut down all our VMs and marked the CephFS as down ceph fs set one down true
- we set some clust flags, namely noout, nodown, pause, nobackfill, norebalance and norecover
- waited for ceph cluster to quieten down, reconfigured ceph and restart the OSDs one failure domain and a time
Once all OSDs had been restarted we switched off the cluster network switches and made sure ceph was still happy. We then reversed the procedure.
Out of the Frying Pan
My memory of events and what we did gets a bit hazy here. Thinking we had a disaster at hand we initiated the disaster recovery procedure and reset the journal. The MDS started again and we had our cephfs back. Unfortunately, it crashed as soon as we started writing data to it.
Into the Fire
At this stage we decided to follow the rest of the disaster recovery procedure. Our cephfs contains ~40TB of data. It took the 4 workers to scan the file extents over 4 days. During this time we observed a ceph read activity between ~500 op/s and ~2 kop/s. It would have been helpful if the documentation gave some hints as to how long a very long time is and how many workers are a reasonable number for a given size file system. For the second phase of scanning the inodes we used 16 workers which completed the task in a few hours. During this phase we maintained a ceph read activity of ~60 kop/s. It would also be useful to know if this process can be distributed over a number of machines. The remaining phases completed relatively quickly.
Resolution
We then tested the filesystem by writing data. It passed the test and we fired up our VMs again. We are back in business.
Conclusion
Reconstructing a 40TB distributed filesystem takes a long time. The filesystem scrub and repair should have done the trick to fix the issues we saw. We do need to chase up some issues where we get regular error messages that clients are failing to respond to cache pressure.
Update: 10/09/2020
In a, perhaps foolish, attempt to be tidy I decided to delete the entries in the lost+found directory. This crashed the MDS. One by one the MDS crashed, restarted and gave up after a few attempts. Eventually all the standby MDS were used up. This is similar to what we saw in the first place. I then applied the lessons I learned over the last few days and
- put on the emergency brakes (the MDS were all down anyway) ceph fs fail one
- restarted all the MDS, eg systemctl reset-failed ceph-mds@store09.service systemctl start ceph-mds@store09.service
- marked the cephfs as up ceph fs set one joinable true
- and finally restarted the scrub ceph tell mds.store08 scrub start / recursive repair
NB: for historic reasons our cephfs is called one.
The good news is that all the cephfs clients continued during the brief downtime. But we still need to figure out how to tidy up the stuff in lost+found. One issue might be that our clients are quite a bit older.