Architecture and configurations for Fred Hutch "Fast File" service, our HPC posix file system
We recently discovered that using thee rsync tool to copy files into our BeeGFS cluster wasing taking longer than it should. In selecting and validating BeeGFS, we tested metadata performance and were sure it was noticeaby faster than our previous storage platform. However, the rsync times we were seeing were worse than the previous platform.
We tried examining the rsync output to see where time was added. Comparing two rsync results (tareting each of our storage platform and with a common source) was not useful. In both cases, the time rsync reported during list building was incredibly low. During these tests our Grafana/Prometheus graphs showed a sustained increase in metadata traffic to our BeeGFS cluster for several hours before a drop to normal levels for several hours before the rsync was done. This time period matched the amount of time the rsync process to our other platform took in its entirety.
Based on the data that seemed to show the rsync targeting BeeGFS spent much more time accessing metadata and about the same amount of time actually transfering files, we assume our metadata performance on BeeGFS is the issue.
Prior to this problem, we did a full range of tests on our BeeGFS cluster. Using fio, we tested different access patterns across different file sizes and concluded our metadata performance was good based on small file size + small writes/reads performance, which was very good.
We used mdtest (now part of IOR) as well, but were unable to find our results. However, we all remember not being alarmed by the test.
We have an internally-developed tool that can create and write to a huge number of file across a directory hierarchy. This was used too test storage platforms, and produced satisfactory reults for BeeGFS.
The most basic metaddata access test is a large, recursive, “long” ls operation. This will read blocks (for directory entries) and run a large number of stat calls to read metadata.
One of our team members developed pwalk. It is a very simple but extremely useful tool that parallelizes walking a file system. It is very fast.
Using ls, we determined that BeeGFS was slower than our previous NFS-based platform. An ls inolving millinos of files took about twice as long on the BeeGFS cluster.
Using pwalk to walk the entire file structure, we found it took about twice as long on BeeGFS.
Nevermind the directory I chose to test is actually hosted on another device. This is not an NFS platform test as I had thought.
Interesting results from this one. The output for each platform is below.
The way I interpret these results… BeeGFS is about 2X the speed of our NFS platform for file creations, removals, and directory tree creation. The NFS platform is much faster for directory tree removals, which is curious.
For file stat operations, our NFS platform is twice as fast as our BeeGFS cluster, except that it has a much higher standard deviation. I suspect cache is at play here - the entries that are cached are accessed much faster. This makes sense. It would be interesting to re-try the NFS platform test with client caching disabled.
BeeGFS:
-- started at 11/01/2019 13:22:48 --
mdtest-1.9.3 was launched with 1 total task(s) on 1 node(s)
Command line used: src/mdtest "-C" "-T" "-r" "-F" "-d" "/mnt/beegfs/fast/_ADM/SciComp/user/bmcgough/mdtest" "-i" "3" "-I" "4096" "-z" "3" "-b" "7" "-L" "-u"
Path: /mnt/beegfs/fast/_ADM/SciComp/user/bmcgough
FS: 4731.7 TiB Used FS: 57.0% Inodes: 0.0 Mi Used Inodes: -nan%
1 tasks, 1404928 files
SUMMARY rate: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
File creation : 439.763 404.632 422.661 14.357
File stat : 3232.092 2645.504 2947.822 239.813
File read : 0.000 0.000 0.000 0.000
File removal : 449.800 414.304 435.508 15.293
Tree creation : 465.711 450.048 456.378 6.737
Tree removal : 96.927 84.924 91.069 4.904
-- finished at 11/01/2019 19:14:57 --
Avere:
-- started at 11/03/2019 07:44:19 --
mdtest-1.9.3 was launched with 1 total task(s) on 1 node(s)
Command line used: src/mdtest "-C" "-T" "-r" "-F" "-d" "/fh/fast/_ADM/SciComp/user/bmcgough/mdtest" "-i" "3" "-I" "4096" "-z" "3" "-b" "7" "-L" "-u"
Path: /fh/fast/_ADM/SciComp/user/bmcgough
FS: -8589934592.0 GiB Used FS: 200.0% Inodes: 8796093022208.0 Mi Used Inodes: 0.0%
1 tasks, 1404928 files
SUMMARY rate: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
File creation : 226.177 204.979 216.422 8.736
File stat : 7903.923 2275.234 6010.631 2641.407
File read : 0.000 0.000 0.000 0.000
File removal : 251.321 183.390 228.056 31.593
Tree creation : 211.374 180.625 200.849 14.305
Tree removal : 280.353 219.106 242.462 27.034
-- finished at 11/03/2019 18:40:35 --
NFS:
-- started at 11/04/2019 15:16:01 --
mdtest-1.9.3 was launched with 1 total task(s) on 1 node(s)
Command line used: src/mdtest "-C" "-T" "-r" "-F" "-d" "/fh/fast/_SR/.mdtest" "-i" "3" "-I" "4096" "-z" "3" "-b" "7" "-L" "-u"
Path: /fh/fast/_SR
FS: 1768.3 TiB Used FS: 79.5% Inodes: 2750615.9 Mi Used Inodes: 73.1%
1 tasks, 1404928 files
SUMMARY rate: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
File creation : 605.074 599.550 603.190 2.575
File stat : 4759.786 1445.451 3616.146 1535.651
File read : 0.000 0.000 0.000 0.000
File removal : 226.638 185.813 209.339 17.239
Tree creation : 742.143 482.543 623.148 107.077
Tree removal : 279.276 174.610 244.247 49.241
-- finished at 11/04/2019 23:16:43 --