A portable book metadata exercise

Posted in archiving hash kitab literature metadata dublincore libgen

One of the things I have been working on the last few weeks is a rust application I have dubbed kitab [1].

In short, the application makes it easy to extract literary metadata to a separate file structure.

The metadata can in turn be applied as extended attributes recursively on a directory for files that match.

The way it's accomplished it simple: The file name of the metadata is the hex representation of the digest of the file. The same digest is used to match files to metadata when applying it back to the file.

There are two advantages to this:

  1. The digest of the media file need not be affected by the metadata, i.e. by embedding metadata in the file itself.
  2. You do not need to use the file name to keep record of what a file is.

Yarr, ye matey-data

Let's demonstrate with an example.

The fabulous Library Genesis project has made available an endpoint to retrieve bibtex entries based on the md5 hash of the book media file.

A version of the Bitcoin White Paper, under the md5 hash bcd99f1ab4155f2a2a362e5b7938a852, can be found there.

If you download this file using a synchronous download link, the browser will provide you with a filename to go with the download.

However, if you use the torrent alternative, the filename will be the md5 hash itself. If you are torrenting a bunch of those files, it quickly becomes a nuisance to distinguish them.

And, of course: In either case there is no guarantee the any metadata comes with the file.

Inside the book

Kitab (v0.0.2) is able to read metadata from both a bibtex source and xattr entries on a file, as well as its native rdf-turtle format.

In kitab's data store, every media file entity in rdf-turtle is keyed with a URN specifying a digest for the file.

To see exactly what that looks like, let's download and import the bibtex metadata for the paper [2]:

bibtex_file=`mktemp`
kitab_dir=`mktemp -d`
curl -s -X GET https://libgen.rs/book/bibtex.php?md5=BCD99F1AB4155F2A2A362E5B7938A852 -o $bibtex_file
kitab --store $kitab_dir import --digest md5:BCD99F1AB4155F2A2A362E5B7938A852 $bibtex_file
cat $kitab_dir/*

The output of the above should be:

<URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
<https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
<https://purl.org/dc/terms/type> "book" .

Now let's say the media file itself has been downloaded to ~/.local/share/transmission. We can apply this metadata as extended attributes.

This time we turn on logging to see what's going on:

$ RUST_LOG=info kitab --store $kitab_dir apply --digest md5 ~/.local/share/transmission
[2022-10-01T11:14:59Z INFO  kitab] have index directory "/tmp/tmp.r0jBm6q4hW"
[2022-10-01T11:14:59Z INFO  kitab] using digest type md5
[2022-10-01T11:14:59Z INFO  kitab] apply from path "/home/lash/.local/share/transmission/"
[2022-10-01T11:14:59Z INFO  kitab] apply DirEntry("/home/lash/.local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852") -> title "Bitcoin: A Peer-to-Peer Electronic Cash System" author "Satoshi Nakamoto" digest md5:bcd99f1ab4155f2a2a362e5b7938a852

$ find ~/.local/share/transmission -type f -regextype sed -regex ".*/[a-f0-9]\{32\}$" -exec getfattr -d {} \;
# file: .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852
user.dcterms:creator="Satoshi Nakamoto"
user.dcterms:title="Bitcoin: A Peer-to-Peer Electronic Cash System"
user.dcterms:type="book"

Let the right one in

Conversely, the metadata can be re-imported directly from the extended attributes. And this time, let's store it both under the md5 and the sha512 hash:

$ kitab_dir_new=`mktemp -d`
$ kitab --store $kitab_dir_new import --digest md5 --digest sha512 .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852
$ find $kitab_dir_new -type f -exec cat {} \;
/tmp/tmp.B6j41YMmEM/493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792
<URN:sha512:493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
        <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
        <https://purl.org/dc/terms/type> "book" ;
        <https://purl.org/dc/terms/MediaType> "application/epub+zip" .
/tmp/tmp.B6j41YMmEM/bcd99f1ab4155f2a2a362e5b7938a852
<URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
        <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
        <https://purl.org/dc/terms/type> "book" ;
        <https://purl.org/dc/terms/MediaType> "application/epub+zip" .

Level up

Finally, a bash script [3] example that lets you retrieve and apply metadata for a batch of files found in the directory given as the first positional arg.

This script even renames the files according to the metadata applied.

 0 # NOTE! this will only work if your fs supports xattr.
 1 # That's why we cannot use tmpfs (mktemp) here; tmpfs does not support xattr.
 2 
 3 # directory to copy media files to
 4 outdir=./$(uuidgen)
 5 mkdir -vp $outdir
 6 
 7 # Input dir is the first positional arg.
 8 indir=$1
 9 
10 IFS=$'\n'
11 
12 # Retrieve metadata for each file and import it into the kitab store.
13 # Also copy the media file to the separate output directory.
14 for f in $(find $indir -type f); do
15         sum=$(md5sum $f | awk '{print $1;}')
16         echo "downloading metadata for $indir/$f"
17         srct=$(mktemp)
18         curl -s -X GET https://libgen.rs/book/bibtex.php?md5=$sum -o $srct
19         dstt=$(mktemp)
20         xmllint --html --xpath 'string(/html/body/textarea[@id="bibtext"])' $srct  > $dstt
21         kitab import --digest md5:$sum $dstt
22         cp $f $outdir/
23 done
24 
25 # Apply metadata imported from bibtex as xattr for the media files.
26 RUST_LOG=info kitab apply --digest md5 $outdir/
27 
28 # Rename the files according to the metadata title and media type.
29 for f in $(ls $outdir); do
30         title=$(getfattr --only-values -n user.dcterms:title $outdir/$f)
31 
32         f_typ=$(file -b --mime-type $outdir/$f)
33         f_ext=""
34         case "$f_typ" in
35                 "application/pdf")
36                         f_ext=".pdf"
37                         ;;
38                 "application/epub+zip")
39                         f_ext=".epub"
40                         ;;
41                 "application/x-mobipocket-ebook")
42                         f_ext=".mobi"
43                         ;;
44                 "text/plain")
45                         f_ext=".txt"
46                         ;;
47                 "text/html")
48                         f_ext=".html"
49                         ;;
50                 *)
51                         >&2 echo unhandled mime type $f_typ
52                         exit 1
53         esac
54         mv -v $outdir/$f $outdir/${title}${f_ext}
55 done

This last example will result in:

  • A media file named $outdir/Bitcoin: A Peer-to-Peer Electronic Cash System.epub
  • ... with metadata applied as extended attributes
  • An rdf-turtle metadata entry in ~/.local/share/kitab/idx/bcd99f1ab4155f2a2a362e5b7938a852
[1]The relevant documentation for kitab at the time of writing is here. To build kitab, simply clone the repository and build with cargo build --all-features.
[2]The kitab command in the script assumes you have built the kitab binary and made it available in your path.
[3]the script uses xmllint which on archlinux is provided by the libxml2 package.