Categories: Linux Tags: data tech

Backing up data to paper

Sort of odd-ball topic, but ended up looking today at paper storage for binary data. Came out of wanting to think about how best to back up things like gnupg keys or password databases.

Anyway, one tool that’s referred to quite often is PaperBack. From my point of view, its not a great option as it’s a windows tool, albeit with full source code. I took a bit of a look through code to see if I could easily enough port it across to Linux, but didn’t make rapid headway.

So I took a bit more time looking for alternatives. There’s a good summary here of various ways to encode binary data into paper-friendly representations. Most of those (e.g. QR Codes) aren’t much good if you’re trying to store 10s of kb or more. However, in the comments there was a good link!

Turns out that twibright have a small program: optar that’s squarely in the linux eco-system, and does much the same (web page references PaperBack too, so they’re working from the same idea). Nice to see this from Twibright Labs, as I remember looking at their pages years back with a view to building a Ronja setup

Anyway, turning back to Optar: that programme worked rather nicely. Basic workflow…​

../optar_dist/optar README README
../optar_dist/pgm2ps *pgm
# Then print out the postscript file
# Then scan it back in, png format, 600dpi
cp scan.png scanB_0001.png
# Note that I changed default settings, by editing optar.h, 
# so as to reduce resolution and increase robustness...
../optar_dist/unoptar 0-32-46-24-3-1-2-24 scanB > output

The very first time I tried this out, file had some corruption. However, after reducing the resolution a bit, it worked just fine. Would take some guts to use this for a binary format (e.g. encrypted content) though!


Further reading turned up some more options.

  • Most interestingly, someone has already ported paperback in the way I considered, resulting in paperback-cli,

  • paperbackup is interesting for relying on QR codes, only for ascii storage though (which is good for things like gpg keys etc.,). The readme for that project also has a good list of other data→paper options, which I think is where I came to paperback-cli

  • colorsafe should have worked for me but I hit a couple of python library snags, and instead put energy into paperback-cli

Using PaperBack-cli

git clone --recursive -j8
# Note, need recursive as there are sub-modules
cd paperback-cli

Building went fine. Encoding was straight-forward also:

# Note went for lower dpi (100) and full size dots, for more robust first go
./paperback-cli --encode -d 100 -s 100 -i LICENSE -o out.bmp
This produced a bmp file that pretty much occupies an A4 page, and encodes the LICENCSE file (about 35K). To print, I generated a PDF using img2pdf
img2pdf  --pagesize A4 -o out.pdf out.bmp
Next, I hit some snags. I used xsane to do my scanning, and scanned to PNG. However when I converted the PNG to BMP using ImageMagick, this file wasn’t readable by paperback-cli. After a bit of rooting in the code, I discovered that the relevant piece was in src/Scanner.c, around line 156, and it was objecting to the compression element of the format. I spent a lot of time trying to get to the bottom of this, and even covnerting to a BMP3 rather than default BMP4 didn’t help. Eventually, I found the format notes on bmp from the ImageMagick website. The key note was:

However, if a PNG input file was used and it contains a gAMA and cHRM
chunk (gamma and chromaticity information) either of which forces
"convert" to write a BMP4. To get a BMP3 you need to get rid of that
information. One way may be to pipeline the image though a minimal
'image data only' image file format like PPM and then re-save as BMP3.
Messy, but it should work.

so, armed with that info:

convert scan0002.png ppm:- | convert - BMP3:scan0002.bmp3
./paperback-cli --decode -i scan0002.bmp3 -o scan.txt
The resulting bmp3 file is more than twice as big as it was previously, and the decoding step now appears to work just fine.

md5sum scan.txt LICENSE 
d32239bcb673463ab874e80d47fae504  scan.txt
d32239bcb673463ab874e80d47fae504  LICENSE

Update 20th August 2018

Not nearly as happy with that as I thought I was! On positive side, did get a 200dpi version to work, and work first time. However when I then added a handwritten note to the page, that no longer worked. Then tried trimming the page, folding note away, etc., but couldn’t get it to work. Re-printed page, and the reprint wouldn’t work. So the time it worked now looks like a fluke.

Gets to the point where I think to get a working route, I would need to systematically work through tools/variables/tests and establish some performance. Also need to validate exactly the steps outside of the tools themselves (in particular getting the png/bmp file onto paper, where I’m using img2pdf at the moment; but also any dirt/contamination/artifacts from my relatively old and well used scanner), to ensure that problems. aren’t being introduced there.

One random observation: I think I’d prefer now to have a less-information dense format (so more resilient to contamination, distortions, etc.,) if that meant it could reliably be scanned via the automatic document feeder. The "waste" in the first case would be offset by the convenience of the latter.

>> Home