In Building the perfect, cheap NAS I discuss bit rot which silently kills digital archives; my solution chosen back then builds on parchive which adds in a manual process redundancy to archives. This redundancy can be used to identify and correct bit rot to some extent. The issue with this approach is that it’s manual (you have to invoke parchive to create redundancy, verify archive integrity and repair archives), time consuming (reading large files and computing redundancy), doesn’t work transparently (you need to know about where parchive stores redundancy information and how to use it) and doesn’t work in case of a file system corruption.
In this post I’ll describe an experiment with a software md RAID6 setup in Ubuntu and dm-integrity.
I talked about why I prefer RAID6 over other setups back in “Building the perfect NAS”; the short version is that in case of a disk failure, a lesser RAID level is left with no redundancy until the new disk has been installed and resynced, which in the age of 10TB HDDs can take several days.
RAID6 doesn’t detect bit rot; it repairs disks only if the hardware reports a sector reading error in which case it rewrites the damaged block. Scrubbing can help to detect bit-rot by comparing sectors between disks, but in the case of a mismatch, md doesn’t take a majority vote but re-computes parity. That means that if you’re lucky and the bit rot occurred on a parity sectors, the sector is rewritten with fresh parity computed from the source data sectors. If you’re unlucky and the source data sector has been corrupted, the parity sector will be overwritten with new parity computed on top of the corrupted source data.
I’ll try to fix that by applying dm-integrity on the disks participating in a RAID6. The idea is that dm-integrity will transform an integrity error into an I/O error which md should interpret as a corrupt block. As far as I understand, mdwill compute the failed block correctly from other blocks.
First experiment: dm-integrity on a partition.
This script (YMMV) sets up dm-integrity on a partition, creates and ext4 file system on top, mounts the file system and writes a file of random content. It then prints the md5 checksum of said file. The file contains a magic string which is then located on the raw disk. The string is overwritten on the raw disk, caches are flushed and the file’s checksum is re-computed. As expected, the computation fails with an I/O error.
#!/bin/sh
echo CREATING DEVICE
integritysetup format /dev/sdb1
integritysetup open /dev/sdb1 sdb1
mkfs.ext4 /dev/mapper/sdb1
mount /dev/mapper/sdb1 /mnt/raid
echo CREATING FILE
dd if=/dev/urandom of=/mnt/raid/file bs=512 count=50000
echo "__ggstring__" >> /mnt/raid/file
dd if=/dev/urandom bs=512 count=50000 >> /mnt/raid/file
md5=`md5sum /mnt/raid/file`
sync
echo TAMPERING
offset=`grep --byte-offset --only-matching --text __ggstring__ /dev/sdb1 | cut -d':' -f1`
dd if=/dev/urandom of=/dev/sdb1 bs=1 count=10 seek=$offset
sync
md5sum /mnt/raid/file
umount /mnt/raid
echo CHECKING
integritysetup close sdb1
integritysetup open /dev/sdb1 sdb1
mount /dev/mapper/sdb1 /mnt/raid
md5_2=`md5sum /mnt/raid/file`
echo MD5:
echo $md5
echo $md5_2
echo CLEANING UP
umount /mnt/raid
integritysetup close sdb1
Issue: resync is slow. I started migrating my 4 disk RAID6 setup to dm-integrity by replacing volumes one by one with dm-integrity volumes. The first replacement went quick although but resync took about 20% longer than with a plain disk without that much CPU overhead. The second replacement however progressed with about 10MB/s with, again, no notable CPU usage and was projected to finish in a week. At that point I stopped the sync and reverted back to plain partitions. Watching iostat reveals that reading and writing doesn’t happen in parallel; first data is being read from the plain (and the one dm-integrity) volumes, then it is written on the second dm-integrity volume. And then some pause.
Updates & errata
“md” had been referred to erroneously as “dm-raid”
I wonder whether this is a faster setup than pure ZFS solution (which you seem to avoid). While ZFS takes too much resources (RAM,CPU), mdraid + integrity might just do worse (even slower, even more CPU).
LikeLike
“Faster” in terms of “effort to set up”: probably not. dm-integrity needs to zero all blocks and md will do that a second time. As mentioned at the end of the post, syncing is too slow to be practical so I never got this experiment into “production”.
LikeLike
Was /proc/sys/dev/raid/speed_limit_max or /proc/sys/dev/raid/speed_limit_min set to a low value? Maybe try a higher stripe cache? echo 16384 > /sys/block/mdXX/md/stripe_cache_size
LikeLike
I left those values at their defaults.
LikeLike