CephFS is now an integral part of the School of GeoScience's IT infrastructure. It provides both personal and group storage for windows, mac and linux users. Since our first disaster due to running out of resources it has been running very reliably and it provided all the benefits we had hoped for. Until disaster stuck again yesterday: The first signs of trouble were users reporting that deleting files resulted in a No space left on device error.
It turned out that a user was deleting a particularly deep directory structure with around 2 million entries. We snapshot our file system to provide easy access to accidentally deleted files. When files get deleted on a snapshotted directory they end up in a stray files location. It is a single location with a default limit of one million entries. We exceeded this limit which resulted in the No space left on device errors.
We tried a lot of things to get the file system to behave again. In the end increasing the number of stray files solved this particular issue. The good news is that the current version of ceph - pacific - has solved this issue because it allows the stray directory to be split so that the limit does no longer exist.
The good news is that we didn't lose any data but we ended up with a central resource that was offline for about 24 hours. Upgrading our ceph system to pacific has just become more urgent.
The Geeky Stuff
Finding the number of stray files:
ceph daemon mds.`hostname -s perf dump | grep stray`
Finding the maximum number of stray files (this number will get multiplied by 10)
ceph daemon mds.`hostname -s config get mds_bal_fragment_size_max`