Tales - On tagging files simply

2018, Toulouse, France by M. Bivert – CC-BY-SA-4.0

It’s common to need to organize large filesets (pictures, papers, etc.). There are specialized programs, but they can get hairy quickly, and simple prototypes are easy to think of, e.g.: have a special directory containing one subdirectory per tag, and use ln(1) to tag files, by storing links under such directories. There are a few drawbacks, for example:

Not resilient to file renaming (or requires at least to duplicate files);
Some queries (e.g. AND) are sligthtly clunky to write.

There’s another, more modest approach, solving those issues at a minor cost: encode the tags in the filename, e.g.: tag0_tag1_-___-<original filename>.

The tag separator (_) is arbitrary; the separator between the tags and the (original) filename (_-___-; looks like a -___- but conveniently doesn’t start with a -) should be unique enough. So, we trade an “ugly” filename and a few false positive (expected to be practically insignificant at worse) for a tagging system which is:

file system independent (almost: “long” filenames can be an issue on old stuff);
reasonably OS-independent;
resilient to renaming;
doesn’t require to install/update/maintain additional software;
code-free (almost: some bits can be helpful, e.g. to ease automated processing);
trivial to migrate away from if need be;
etc.

Arbitrary queries can be performed:

OR: ... | grep '^(tag0|tag1)_-___-;
AND: ... | grep '^tag0.*tag1.*tag5.*_-___-' (just sort the tags alphabetically);
NOT: ... | grep -v 'tag0.*_-___-';

Finally, locate(1)/updatedb(8) naturally provides a central access to all tagged files, on a per-disk basis though: by default, it only manages files on the root file system:

$ updatedb
$ locate _-___-
...

If the files are on an external hard drive, you’ll “have to” create a dedicated database (and perhaps write a small wrapper to ease querying):

$ updatedb -l 0 -o ~/.tmp0-updatedb.db -U /mnt/tmp0/
$ locate -d ~/.*.db _-___-
$ locate -d ~/.*.db _-___-|wc -l
199

For the record, most modern file systems support 255-bytes long filenames, for instance:

$ getconf  NAME_MAX /
255
$ mount |grep ' / '
/dev/sda6 on / type ext4 (rw,relatime)

A Huffman coding-like, with printable characters instead of bits, could be used to compress things further, if need be. This can be performed approximatively by hand to a reasonable degree.

False garlic, 2018, Toulouse, France by M. Bivert – CC-BY-SA-4.0

As I need to (manually) tag dense directories (c. 4000 files), I’ve wrote two small scripts to help with automated processing:

The first allows to add/remove tags to a file (trim duplicates, alphabetical sort);
The second relies on the first one to provide a way to tag batches of files stored in a single directory.

#!/bin/sh

set -e

# <tags>$sep<filename>
sep='_-___-'

# <tags>=<tag0>$tsep<tag1>$tsep...
# XXX assumed to be one byte long later
tsep='_'

if [ -z "$1" ]; then
	echo `basename $0` '[-dr] <path/to/file>' '[tags]' 1>&2
	exit 1
fi

dryrun=mv
rm=
while getopts "dr" opt; do
	case "$opt" in
	d) dryrun="echo mv";;
	r) rm=1;;
	esac
done

shift $((OPTIND-1))

if echo $1 | grep -q '[ 	]'; then
	echo "No spaces allowed in filename: '$1'" 1>&2
	exit 1
fi

f=$1
shift; tags="$@"

tags2lines() {
	tr -s ''$tsep' \t' '\n'
}

lines2tags() {
	# XXX 2 = 1+length($sep)
	sort -u | awk '{ s = s "'$tsep'" $1 } END{ print substr(s, 2) }'
}

d=`dirname $f`
basename $f | awk -F"$sep" '{
		if ($2 == "") printf("\n%s\n",  $1);
		else          printf("%s\n%s\n", $1, $2); }' | {
	read ts; read fn

	if [ -z "$rm" ]; then
		tags=$(echo $tags $ts | tags2lines | lines2tags)
	else
		tags=$(echo $tags | sed 's,['$tsep' 	]\+,\\\|,g')
		tags=$(echo $ts | tags2lines | grep -v "$tags" | lines2tags)
	fi

	if [ "$f" != "$d/$tags$sep$fn" ]; then
		$dryrun $f $d/$tags$sep$fn
	fi
}

#!/bin/sh

# e.g.: sh batchtag -s "feh -." /home/$USER/photos/

# TODO: batch tag renaming

set -e

# <tags>$sep<filename>
sep='_-___-'

# <tags>=<tag0>$tsep<tag1>$tsep...
# XXX assumed to be one byte in ./tag
tsep='_'

PATH=.:$PATH:

if ! which tag >/dev/null; then
	echo 'tag(1) not found in $PATH' 1>&2
	exit 1
fi

skip=
if [ "$1" = "-s" ]; then skip=1; shift; fi

# e.g. "feh -." for viewing images, "xpdf" for .pdfs, etc.
if [ -z "$2" ]; then
	echo `basename $0` '[-s] <prog> <path/to/dir/>' 1>&2
	exit 1
fi

for x in $2/*; do
	if [ -n "$skip" ] && echo $x | grep -q -- $sep; then
		continue
	fi

	$1 $x &
	pid=$!

	echo $x; echo -n "new set of tags: ";

	# assume ^D
	if ! read ts; then kill $pid; exit 0; fi

	# $pid may have been killed already
	set +e; kill $pid; set -e

	# Empty string is a no-op
	if [ "$ts" != "" ]; then tag $x $ts; fi
done

Comments

By email, at mathieu.bivert chez: