Checksums and deduplicating on QNAP / Linux
I have a QNAP NAS where I copy over the content of my phone every now and then as a backup. The problem is that you end up with a ton of directories called ‘Phone_2025_09_24″ and so on but with basically the same files. So how do we deduplicate this?
The script below will work on QNAP but you’ll have to connect to it via SSH and install a few packages. Since the QNAP doesn’t come with a package manager by default we’ll install a community driven one called ‘entware’. For installation instructions I’ll refer you to the website: https://github.com/Entware/Entware/wiki
If you’re reading along and have a regular Linux server, like Ubuntu, you can install the needed packages with a
apt install xxhash attr gawk
Why xxhash?
The more standard shasum and other hashing methods are more concerned about calculating a cryptographically secure hash, meaning making sure that it’s hard to make two files come up with the same hash. On the other hand xxh128sum is build for speed.
time shasum 1GB_file # real 0m3.783s
time xxh128sum 1GB_file # real 0m0.396s
Since we don’t care if our hashes can be easily recreated and the chance of collisions isn’t that high anyway, we go with xxh128sum.
Where do we store the hashes?
One trick we’ll be using is the extended attributes of our filesystem. Modern filesystems, like EXT4 and XFS, have the option of adding attributes to files. These are used for ACLs for example but regular users are allowed to make their own attributes. The advantage is that they are linked to a file. Moving a file keeps the attributes attached to the file. Deleting a file removes the attributes as well. So this seems like a good place to keep the information.
What else do we store?
Because I also use this script to check if a file has changed, I also store the filesize and ‘last modified time’. We will use the filesize as well for deduplicating to make the chance of hash collisions even smaller but we’ll ignore the last modified time. Copying over the files can change that timestamp so it’s useless for what we’re trying to achive.
Creating the hashes
What we’ll do is create a number of files while we work. This way you can check what’s in the file before we continue with the next step.
To begin we make a list of all the files we want to create hashes for:
find /share/Backups -type f > 1
If you want, you can check the list. We use this as input for gawk which will call the programs to create the data that we need. The ‘gensub’ function at the start below will escape quotes in filenames so that when we execute the result later, bash won’t get confused.
cat 1 | gawk 'BEGIN{FS=OFS="\t"; CHECK=0;} {
$1=gensub("\x27", "\x27\\\\\x27\x27", "g", $1);
C="xxh128sum \x27" $1 "\x27";
C | getline HASH;
close(C);
HASH=substr(HASH, 1, 32);
C="stat -c \"%s_%Y_\" \x27" $1 "\x27";
C | getline SIZE_MODIFIED;
close(C);
if(CHECK)
{
HASH_NOW="no_hash"
C="getfattr --only-values -n user.checksum \x27" $1 "\x27 2>/dev/null";
C | getline HASH_NOW;
close(C);
if(SIZE_MODIFIED HASH != HASH_NOW)
print "MISMATCH: " SIZE_MODIFIED HASH OFS HASH_NOW OFS "\x27" $1 "\x27";
}
else
{
print SIZE_MODIFIED HASH OFS $1 > "/dev/stderr"
print "setfattr -n user.checksum -v " SIZE_MODIFIED HASH " \x27" $1 "\x27"
}
}' > 2
Once this is done, we have a file ‘2’ that will contain commands to set the attributes to the files we parsed. If you just want to deduplicate, you don’t need to do this. If you do, then run it like this:
bash 2
Deduplicating
Once complete we can now deduplicate using the filesizes, hashes and filenames in the ‘2’ file.
# Create a list of just filesize, hash and filename
cat 2 | cut -f 5- -d ' ' | cut -f 1,3- -d '_' | sed 's/ /\t/' | sort > 3
# Now we parse the list and only print duplicates
cat 3 | gawk 'BEGIN{FS=OFS="\t"; H="";} {if(H == $1) print "rm " $2; H=$1;}' > 4
# By running the result, we erase all duplicates
bash 4
Feel free to view the files in between the steps.
Integrity checking
If you ran file ‘2’ and added the hashes as attributes to the files, you can do integrity checks later on by using the same script to create the hashes, but changing the CHECK variable to 1. This will print out any mismatches and can let you know if a file has been changed.