parchive: protecting backups against data corruption

As time goes by, my [place appropriate interval] backups may not grow much, but the size of my entire data archive exceeds the largest available hard disks (don’t ask). Taking into account correctness guarantees by hard disk manufacturers, transmission errors over the network and who knows what else, I don’t feel confident that I’ll flip a few bits before data in my archive do.

Gradual data degradation due to random events is known as bit rot. Unfortunately, at the time of writing (2018), commodity hardware offers insufficient protection against data degradation and main stream file systems use checksums only to verify integrity, but not to correct data errors.

Parchive is a “system” (file format + implementation) that adds redundancy to files. par2 is available for Linux and is, despite its limitations, easy to use.

Par2 reads an archive and outputs one or more files which contain  redundancy necessary to verify and correct errors in the archive. The amount of redundancy is configurable. Seeing is believing, so I conducted an experiment:

  1. compute checksum of a big file and write that checksum down
  2. par2 creates redundancy archives for that file
  3. the file is corrupted on purpose
  4. compute checksum of the corrupted file; verify that this step computes a different checksum than (1)
  5. par2 verifies that ISO is corrupted
  6. par2 restores ISO from redundancy archive
  7. compute checksum of restored ISO and verify that it’s the same as in step (1)
> wget wget http://archive.ubuntu.com/ubuntu/dists/cosmic/main/installer-amd64/current/images/netboot/mini.iso

> md5sum mini.iso
829cf59ebf1585b370c8ecc05f84fd98

> par2create -n1 -r30 mini.iso

>ls
mini.iso
mini.iso.par2
mini.iso.vol000+600.par2

> dd if=/dev/zero seek=1000 bs=1 count=10000 of=mini.iso conv=notrunc
> dd if=/dev/zero seek=10000 bs=1 count=10000 of=mini.iso conv=notrunc
> dd if=/dev/zero seek=100000 bs=1 count=10000 of=mini.iso conv=notrunc

> md5sum mini.iso
7ff22f82ea52050653f71274bfe6c7ffpar2verify mini.iso.par2
...
Repair is required.

> par2repair mini.iso.par2
> md5sum mini.iso
829cf59ebf1585b370c8ecc05f84fd98

Par2 in real life

I particularly value that par2 write redundancy information to new files without modifying the source archive. This means that the source archive is always easily accessible, verification and (if necessary) restoration can be carried out as distinct tasks. Par2 does not operate recursively on directories or multiple files, making it a single-file tool. I’ve seen shell scripts which add recursiveness to par2, but I think that doesn’t help in case of filesystem corruption where entire folders might become inaccessible, including the recovery information. I found it best to create single archives with tar and have par2create in a second step add redundancy to tar files, which implicitly covers the file system structure corruption case since the file system structure in the tar is guaranteed by par2’s redundancy. An obvious issue with this approach is that errors which happened during archive creation won’t be correctable.

A word of warning: par2 is computationally expensive making the CPU a frequent bottleneck, which is an issue with many power efficient NAS.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s