Architecture and configurations for Fred Hutch "Fast File" service, our HPC posix file system
ZFS install on Ubuntu 18.04 from git master:
The spl repo is missing autogen.sh, I assume because spl is integrated into zfs with 0.8.0. Did not build spl separately.
Missing pkg requirements:
uuid-dev, libblkid-dev
Missing:
python3-setuptools
produces warnings, presumably pyzfs incompatibility?
python3-dev
does not error until well into the deb build process)
python3-cffi
Build following the build procedure on the zfs on linux wiki.
Error during deb package installation: dpkg: error processing archive zfs-dkms_0.8.0-0_amd64.deb (–install): trying to overwrite ‘/usr/src/zfs-0.8.0/include/linux/blkdev_compat.h’, which is also in package kmod-zfs-devel 0.8.0-0 dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
No zfsutils-linux package is created, but utilities are in ‘bin’ in the repo after build.
A script is installed with the zfs-test package: /usr/share/zfs/zfs.sh
This script can load and unload kernel modules - use this to load your newly built modules as it re-calls udev and does other housekeeping.
Ensure the zfs package installed zvol, vdev, and zfs rules files into /lib/udev/rules.d.
Our /etc/zfs/vdev_id.conf file:
multipath no
topology sas_direct
phys_per_port 24
# PCI_ID HBA PORT CHANNEL NAME
channel 86:00.0 0 e0s
This results in /dev/disk/bv-vdev/e0s0 through e0s23. Our test platform has 24 drives - 12 front and 12 rear all on the same port/enclosure.
Unloaded took 18 hours:
Feb 3 2019 20:08:04.880790864 sysevent.fs.zfs.resilver_start
Feb 4 2019 14:46:20.969945674 sysevent.fs.zfs.resilver_finish
Loaded (fio config below) took 55 hours:
Feb 4 2019 19:53:38.458910596 sysevent.fs.zfs.resilver_start
Feb 7 2019 03:24:03.113510132 sysevent.fs.zfs.resilver_finish
Looks like unloaded rebuild is linear to space used. 97% full unloaded rebuild took 37 hours:
Feb 9 2019 16:27:26.067637528 sysevent.fs.zfs.resilver_start
Feb 11 2019 05:46:41.114742985 sysevent.fs.zfs.resilver_finish
Unloaded took 14 hours:
Feb 22 2019 18:41:49.225531883 sysevent.fs.zfs.resilver_start
Feb 23 2019 08:45:48.674542218 sysevent.fs.zfs.resilver_finish
Fio job file:
[global]
name=ffr-io-load
directory=/loc/ffr_io_load
runtime=7d
time_based=1
blocksize=128k
ioengine=libaio
fallocate=native
[downloads]
rw=write
size=1G
numjobs=20
[reads]
rw=read
size=1G
numjobs=20
[random_mix1]
rw=randrw
rwmixread=70
rwmixwrite=30
numjobs=20
size=128k
RAIDZ3 took 83 hours:
Mar 6 2019 20:03:41.135789535 sysevent.fs.zfs.resilver_start
Mar 10 2019 07:20:13.331109889 sysevent.fs.zfs.resilver_finish
RAIDZ2 took 91 hours:
Mar 6 2019 20:04:41.812605041 sysevent.fs.zfs.resilver_start
Mar 10 2019 15:08:14.925554048 sysevent.fs.zfs.resilver_finish
fio file:
[global]
name=ffr-io-load
directory=/loc5/ffr_io_load
blocksize=128k
ioengine=libaio
fallocate=native
write_bw_log
write_lat_log
write_iops_log
write_hist_log
log_avg_msec=1000
time_based=1
runtime=600000
[downloads5]
directory=/loc5/ffr_io_load
rw=write
size=1G
numjobs=4
[downloads6]
directory=/loc6/ffr_io_load
rw=write
size=1G
numjobs=4
[reads5]
directory=/loc5/ffr_io_load
new_group
rw=read
size=1G
numjobs=4
[reads6]
directory=/loc6/ffr_io_load
new_group
rw=read
size=1G
numjobs=4
[random_mix]
directory=/loc5/ffr_io_load
new_group
rw=randrw
rwmixread=70
rwmixwrite=30
numjobs=4
size=1G
[random_mix]
directory=/loc6/ffr_io_load
new_group
rw=randrw
rwmixread=70
rwmixwrite=30
numjobs=4
size=1G
The drives are Seagate ST12000NM0027, which are “12TB” 4k sector drives maybe presenting as 512 sector. ZFS correctly auto sets ashift to 12, and we see 10TiB raw from these drives.
Test | 11 drv | 9 drv | 7 drv |
---|---|---|---|
Write | 439758 | 391654 | 272728 |
Read | 619304 | 396074 | 278128 |
Mix | 221804 | 196202 | 136796 |
Total | 1280866 | 983930 | 687652 |
11-drive RAIDZ3 compressed and encrypted pool:
zpool create -o feature@encryption=enabled -O compression=lz4 -O encryption=on -O keylocation=prompt -O keyformat=passphrase loc-enc raidz3 <vdevs...>
During testing I used a prompted passphrase manually entered. On zpool import
encrypted zfses are not mounted if they do not have a key available. However, zfs list
will show the zfs along with the mountpoint, but the mountpoint will be missing from the system. There is no clear indication from zfs that the encrypted file system is missing its key. There is a property keystatus of a zfs that can be queried.
It also appears that you cannot pre-load a zfs key, before the zpool is imported. This makes sense. You can specify -l
to zpool to try to load encrypted file systems, which will prompt for the key (if you set keylocation to prompt) during import.
Testing CPU use and throughput with ZFS native encryption enabled.
fio --name=random-write --ioengine=sync --iodepth=16 --rw=randwrite --bs=4k --direct=0 --size=512m --numjobs=20 --end_fsync=1
Array Configuration | 10 jobs | 20 jobs | 30 jobs |
11-drive RAIDZ3 LZ4 | 7% | 3% | 4% |
11-drive RAIDZ3 LS4 enc | 16% | 37% | 34% |
11-drive RAIDZ2 LZ4 | 8% | 3% | 4% |
11-drive RAIDZ2 LZ4 enc | 18% | 49% | 40% |
[Note: Odd that 10-job tests cause more CPU use than more jobs, but it bears out over repeated testing.]
fio test for above:
fio --name=random-write --ioengine=sync --iodepth=16 --rw=randwrite --bs=4k --direct=0 --size=512m --numjobs=<num> --end_fsync=1
I initially had some problems with ZED not processing drive faults and not called the led statechange script. Ensuring I had all OS packages installed from the repo build procedure and putting vdev_id.conf in place and processed seems to have made everything work.
You will likely want to edit /etc/zfs/zed.d/zed.rc
to suit your needs.
I was able to confirm the enclosure/bay mapping for our test system like this:
05 | 11 | 17 | 23 |
04 | 10 | 16 | 22 |
03 | 09 | 15 | 21 |
02 | 08 | 14 | 20 |
01 | 07 | 13 | 19 |
00 | 06 | 12 | 18 |
Our test system has:
/sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/host6/port-6:0/expander-6:0/port-6:0:25/end_device-6:0:25/target6:0:
24/6:0:24:0/enclosure/6:0:24:0
With Slot00 through Slot23 in that directory. Each Slotnn directory has a fault file that can be read or written to turn the fault light on (1) or off (0).
ZFS can give additional information during a zpool status
run by specifying the -c
flag. Alone, this should produce a list of available scripts (from /etc/zfs/zpool.d). Given one of the scripts as an argument, it will run the script which will produce additional columns of output that are incorporated into the zpool status output. For example, zpool status -c ses <pool>
will show you the enclosure and slot IDs for each vdev, along with LED states:
pool: loc-enc
state: ONLINE
scan: resilvered 24K in 0 days 00:00:01 with 0 errors on Mon Apr 1 20:22:55 2019
config:
NAME STATE READ WRITE CKSUM enc encdev slot fault_led locate_led
loc-enc ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
e0s0 ONLINE 0 0 0 0:0:24:0 sg24 0 0 0
e0s1 ONLINE 0 0 0 0:0:24:0 sg24 1 0 0
e0s2 ONLINE 0 0 0 0:0:24:0 sg24 2 0 0
e0s3 ONLINE 0 0 0 0:0:24:0 sg24 3 0 0
e0s4 ONLINE 0 0 0 0:0:24:0 sg24 4 0 0
e0s5 ONLINE 0 0 0 0:0:24:0 sg24 5 0 0
e0s6 ONLINE 0 0 0 0:0:24:0 sg24 6 0 0
e0s7 ONLINE 0 0 0 0:0:24:0 sg24 7 0 0
e0s8 ONLINE 0 0 0 0:0:24:0 sg24 8 0 0
e0s9 ONLINE 0 0 0 0:0:24:0 sg24 9 0 0
e0s10 ONLINE 0 0 0 0:0:24:0 sg24 10 0 0
errors: No known data errors
ZFS Capacity calculator - appears to be accurate for the arrays I have built and compared.