In Building the perfect, cheap NAS I discuss bit rot which silently kills digital archives; my solution chosen back then builds on parchive which adds in a manual process redundancy to archives. This redundancy can be used to identify and correct bit rot to some extent. The issue with this approach is that it’s manual (you have to invoke parchive to create redundancy, verify archive integrity and repair archives), time consuming (reading large files and computing redundancy), doesn’t work transparently (you need to know about where parchive stores redundancy information and how to use it) and doesn’t work in case of a file system corruption.
In this post I’ll describe an experiment with a software md RAID6 setup in Ubuntu and dm-integrity.
I talked about why I prefer RAID6 over other setups back in “Building the perfect NAS”; the short version is that in case of a disk failure, a lesser RAID level is left with no redundancy until the new disk has been installed and resynced, which in the age of 10TB HDDs can take several days.
RAID6 doesn’t detect bit rot; it repairs disks only if the hardware reports a sector reading error in which case it rewrites the damaged block. Scrubbing can help to detect bit-rot by comparing sectors between disks, but in the case of a mismatch, md doesn’t take a majority vote but re-computes parity. That means that if you’re lucky and the bit rot occurred on a parity sectors, the sector is rewritten with fresh parity computed from the source data sectors. If you’re unlucky and the source data sector has been corrupted, the parity sector will be overwritten with new parity computed on top of the corrupted source data.
I’ll try to fix that by applying dm-integrity on the disks participating in a RAID6. The idea is that dm-integrity will transform an integrity error into an I/O error which md should interpret as a corrupt block. As far as I understand, mdwill compute the failed block correctly from other blocks.
First experiment: dm-integrity on a partition.
This script (YMMV) sets up dm-integrity on a partition, creates and ext4 file system on top, mounts the file system and writes a file of random content. It then prints the md5 checksum of said file. The file contains a magic string which is then located on the raw disk. The string is overwritten on the raw disk, caches are flushed and the file’s checksum is re-computed. As expected, the computation fails with an I/O error.
#!/bin/sh echo CREATING DEVICE integritysetup format /dev/sdb1 integritysetup open /dev/sdb1 sdb1 mkfs.ext4 /dev/mapper/sdb1 mount /dev/mapper/sdb1 /mnt/raid echo CREATING FILE dd if=/dev/urandom of=/mnt/raid/file bs=512 count=50000 echo "__ggstring__" >> /mnt/raid/file dd if=/dev/urandom bs=512 count=50000 >> /mnt/raid/file md5=`md5sum /mnt/raid/file` sync echo TAMPERING offset=`grep --byte-offset --only-matching --text __ggstring__ /dev/sdb1 | cut -d':' -f1` dd if=/dev/urandom of=/dev/sdb1 bs=1 count=10 seek=$offset sync md5sum /mnt/raid/file umount /mnt/raid echo CHECKING integritysetup close sdb1 integritysetup open /dev/sdb1 sdb1 mount /dev/mapper/sdb1 /mnt/raid md5_2=`md5sum /mnt/raid/file` echo MD5: echo $md5 echo $md5_2 echo CLEANING UP umount /mnt/raid integritysetup close sdb1
Issue: resync is slow. I started migrating my 4 disk RAID6 setup to dm-integrity by replacing volumes one by one with dm-integrity volumes. The first replacement went quick although but resync took about 20% longer than with a plain disk without that much CPU overhead. The second replacement however progressed with about 10MB/s with, again, no notable CPU usage and was projected to finish in a week. At that point I stopped the sync and reverted back to plain partitions. Watching iostat reveals that reading and writing doesn’t happen in parallel; first data is being read from the plain (and the one dm-integrity) volumes, then it is written on the second dm-integrity volume. And then some pause.
Updates & errata
“md” had been referred to erroneously as “dm-raid”