As time goes by, my [place appropriate interval] backups may not grow much, but the size of my entire data archive exceeds the largest available hard disks (don’t ask). Taking into account correctness guarantees by hard disk manufacturers, transmission errors over the network and who knows what else, I don’t feel confident that I’ll flip a few bits before data in my archive do.
Gradual data degradation due to random events is known as bit rot. Unfortunately, at the time of writing (2018), commodity hardware offers insufficient protection against data degradation and main stream file systems use checksums only to verify integrity, but not to correct data errors.
Parchive is a “system” (file format + implementation) that adds redundancy to files. par2 is available for Linux and is, despite its limitations, easy to use.
Par2 reads an archive and outputs one or more files which contain redundancy necessary to verify and correct errors in the archive. The amount of redundancy is configurable. Seeing is believing, so I conducted an experiment:
- compute checksum of a big file and write that checksum down
- par2 creates redundancy archives for that file
- the file is corrupted on purpose
- compute checksum of the corrupted file; verify that this step computes a different checksum than (1)
- par2 verifies that ISO is corrupted
- par2 restores ISO from redundancy archive
- compute checksum of restored ISO and verify that it’s the same as in step (1)
> wget http://archive.ubuntu.com/ubuntu/dists/cosmic/main/installer-amd64/current/images/netboot/mini.iso > md5sum mini.iso 829cf59ebf1585b370c8ecc05f84fd98 > par2create -n1 -r30 mini.iso >ls mini.iso mini.iso.par2 mini.iso.vol000+600.par2 > dd if=/dev/zero seek=1000 bs=1 count=10000 of=mini.iso conv=notrunc > dd if=/dev/zero seek=10000 bs=1 count=10000 of=mini.iso conv=notrunc > dd if=/dev/zero seek=100000 bs=1 count=10000 of=mini.iso conv=notrunc > md5sum mini.iso 7ff22f82ea52050653f71274bfe6c7ffpar2verify mini.iso.par2 ... Repair is required. > par2repair mini.iso.par2 > md5sum mini.iso 829cf59ebf1585b370c8ecc05f84fd98
Par2 in real life
I particularly value that par2 write redundancy information to new files without modifying the source archive. This means that the source archive is always easily accessible, verification and (if necessary) restoration can be carried out as distinct tasks. Par2 does not operate recursively on directories or multiple files, making it a single-file tool. I’ve seen shell scripts which add recursiveness to par2, but I think that doesn’t help in case of filesystem corruption where entire folders might become inaccessible, including the recovery information. I found it best to create single archives with tar and have par2create in a second step add redundancy to tar files, which implicitly covers the file system structure corruption case since the file system structure in the tar is guaranteed by par2’s redundancy. An obvious issue with this approach is that errors which happened during archive creation won’t be correctable.
A word of warning: par2 is computationally expensive making the CPU a frequent bottleneck, which is an issue with many power efficient NAS.
3 thoughts on “parchive: protecting backups against data corruption”
As I’ve posted elsewhere, as I see it what is demanded for full data reliability is a two factor approach; a low level of redundancy ECC to solve small corruption issues, such as ‘silent’ bit rot [and more importantly, to automate the DETECTION of bit rot events], and full backups [ideally 3:2:1 backups that also contain the ECC information files] to provide a recovery option for grand failures, mass corruption, and total loss events.
No amount of ECC will fix a hard drive that burns in a structure fire, and no amount of simple backups will help you know if a bit has rotted in your archive. Hybridization is the key to assurance here.
Noting as you’ve mentioned protecting the file structure, outputting a recursive directory list to a text file is pretty trivial from the command line of whatever major OS you may favor, and you can protect that file with par2 and backups, and have it to restore lost directory structure info, but it could be awfully labor intensive to restore from.
That’s exactly what I do at the moment, and going with ‘let par handle small stuff, let archived backups handle the big stuff’ as a model, I’ll probably never need that text file but still… if you know a more thorough way to preserve the directory tree let me know!
Small typo, you put `wget` twice, next to each other in the first command example.
Thank you Nick.