Massive data backup question: What Linux software do you folks recommend for helping sort out and organize terabytes of files and remove duplicates?

over_clox@lemmy.world · edit-2 5 months ago

Massive data backup question: What Linux software do you folks recommend for helping sort out and organize terabytes of files and remove duplicates?

doeknius_gloek@discuss.tchncs.de · 5 months ago

I’ve had great success with restic. It will handle your 4TB just fine, here’s some stats of mine:

Total File Count: 78374
Total Size: 13.324 TiB

and another one, not as large but with lots of files

Total File Count: 1295210
Total Size: 2.717 TiB

Restic will automatically deduplicate your data so your duplicates won’t waste storage at your backup location.

I’ve recently learned about backrest which can serve as a restic UI if you’re not comfortable with the cli, but I haven’t used it myself.

To clean your duplicates at the source I would look into Czkawka as another lemming already suggested.

Ekpu@lemmy.world · 5 months ago

I use backrest selfhostet on my server running yunohost. It is pretty much set and forget. I love it.

Squizzy@lemmy.world · 5 months ago

Hey, does this have a gui? I am new to linux and cant quite handle doing work like thisnwithout a gui.

Churbleyimyam@lemm.ee · edit-2 5 months ago

I’ve had success using Czkawka (hiccup) for deduplicating

huskypenguin@sh.itjust.works · 5 months ago

Yea this software rules. I’ve analyzed 20TB with it.

Alas Poor Erinaceus@lemmy.ml · 5 months ago

For duplicates: Czkawka. Also, you get a gold ⭐ if you can figure out how to pronounce it 😉

Treasure@feddit.org · 5 months ago

Take a look into borg backup.

Dessalines@lemmy.ml · 5 months ago

Nightly rsync job in crontab works well enough, if its an external hard drive.

If you’re going over a network, syncthing.

Nine@lemmy.world · 5 months ago

I’ve abused syncthing in some many ways migrating servers and giant data sets. It’s freaking amazing. Though it’s been a few years since I’ve used it. Can only guess how much better it’s gotten.

over_clox@lemmy.world · 5 months ago

‘An’ drive? I mean like 10+ drives, looking to do a master backup.

Dessalines@lemmy.ml · 5 months ago

Rsync then.

over_clox@lemmy.world · 5 months ago

Please do explain then.

I have multiple drives with various differing directory trees.

Dessalines@lemmy.ml · 5 months ago

I have no idea what your setup is so you’ll need to do your own research on rsync.

over_clox@lemmy.world · 5 months ago

That’s just it, there is no setup, except Linux Mint as the main system. It’s literally a physical bucket of discs and drives in all sorts of various formats…

MrPoopbutt@lemmy.world · 5 months ago

Isnt syncthing no longer supported?

Does that even matter if it isnt?

Dessalines@lemmy.ml · 5 months ago

Syncthing is very much alive.

🧟‍♂️ Cadaver@lemmy.world · 5 months ago

Syncthing has been discontinued on android (but a fork exists)

MonkderVierte@lemmy.ml · edit-2 5 months ago

That is filesystem-level. Btrfs and i think ZFS? have deduplication built in.

Btrfs gave me 150 GB on my 2 TB gaming disk that way.

baltakatei@sopuli.xyz · 5 months ago

Personally, my toolkit includes Jdupes for duplication scanning and Rsync for directory merging and file transfer.

lordnikon@lemmy.world · 5 months ago

I have had good luck with Dupeguru

truthfultemporarily@feddit.org · 5 months ago

So a lot of backup solutions do deduplication on the block level, so if you use a backup software that does this, you don’t need to dedup files.

over_clox@lemmy.world · 5 months ago

I have like 10+ hard drives and probably 75+ optical discs to back up, and across the different devices and media, the folder and file structure isn’t exactly consistent.

I already know in advance that I’m gonna have to curate this backup myself, it’s not quite as easy to just purely let backup/sync software do it all for me.

But I do need software to help.

everett@lemmy.ml · edit-2 5 months ago

across the different devices and media, the folder and file structure isn’t exactly consistent.

That’s the thing: it doesn’t need to be. If your backup software or filesystem supports block-level deduplication, all matching data only gets stored once, and filenames don’t matter. The files don’t even have to 100% match. You’ll still see all your files when browsing, but the system is transparently making sure to only store stuff once.

Some examples of popular backup software that does this are Borgbackup and Restic, while filesystems that can do this include BTRFS and ZFS.

over_clox@lemmy.world · 5 months ago

I guess you’re missing the point then. I’m backing up data coming from many different file systems, FAT12, FAT16, FAT32, EXFAT, NTFS, HPFS, EXT2, 3 and 4, ISOs (of varying degrees of copy protection plus MODE1 and MODE2 discs with audio tracks)…

Plus different date revisions of many files.

You think there’s anything consistent enough where any one solution works?

I need all the recommended software I can throw at it. Sure I’d love a purely automated solution, but i know there’s still gonna be a lot of manual curating on my part as well.

Also, files don’t have to match and filenames aren’t important? Are you a psychopath? That’s exactly what I want, to organize folder and filenames, and match and remove duplicates based on file hashes.

everett@lemmy.ml · 5 months ago

Either I’m massively misunderstanding why it is you want to curate your backup by hand, or you’re missing the point of block-level deduplication. Shrug, either is possible.

over_clox@lemmy.world · 5 months ago

I get the concept of block level reduplication, no problem.

But some of these drives came from friends that reorganized their copy of files their own way, while I took my main branch they copied from and salvaged damaged files.

Ever heard of goodtools? I’ve spent an awful lot of time salvaging corrupt video game console ROMs. I have all of Atari 2600, most of NES and SNES, a number of N64 and a number of PSP games, along with a lot of other stuff.

I ain’t about to play headgames on what I have and haven’t salvaged already, I must keep track of what device stores what, what filename is what, and what dates are what.

I want an organized file/folder structure. I didn’t spend the past 20+ years to trust everything to automation.

everett@lemmy.ml · edit-2 5 months ago

I ain’t about to play headgames on what I have and haven’t salvaged already, I must keep track of what device stores what, what filename is what, and what dates are what.

This is precisely the headache I’m trying to save to you from: micromanaging what you store for the purpose of saving storage space. Store it all, store every version of every file on the same filesystem, or throw it into the same backup system (one that supports block-level deduplication), and you won’t be wasting any space and you get to keep your organized file structure.

Ultimately, what we’re talking about is storing files, right? And your goal is to now keep files from these old systems in some kind of unified modern system, right? Okay, then. All disks store files as blocks, and with block-level dedup, a common block of data that appears in multiple files only gets stored once, and if you have more than one copy of the file, the difference between the versions (if there is any) gets stored as a diff. The stuff you said about filenames, modified dates and what ancient filesystem it was originally stored on… sorry, none of that is relevant.

When you browse your new, consolidated collection, you’ll see all the original folders and files. If two copies of a file happen to contain all the same data, the incremental storage needed to store the second copy is ~0. If you have two copies of the same file, but one was stored by your friend and 10% of it got corrupted before the sent it back to you, storing that second copy only costs you ~10% in extra storage. If you have historical versions of a file that was modified in 1986, 1992 and 2005 that lived on a different OS each time, what it costs to store each copy is just the difference.

I must reiterate that block-level deduplication doesn’t care what files the common data resides in, if it’s on the same filesystem it gets deduplcated. This means you can store all the files you have, keep them all in their original contexts (folder structure), without wasting space storing any common parts of any files more than once.

over_clox@lemmy.world · 5 months ago

Also, try converting Big Endian vs Little Endian ROM file formats. I spent many months doing that, via goodtools.

I’m not in any hurry to accidentally overwrite a ROM that’s been corrected for consistency in my archives because some automatic sync software might think they’re supposed to be the same file.

over_clox@lemmy.world · 5 months ago

Block level dedupe doesn’t account for random data at the end of the last block. I want a byte for byte hash level and folder comparison, with the file slack space nulled out. I also want to consolidate all related files into logically organized folders, not just a bunch of random folders titled ‘20250505 Backup Turd’

I also have numerous drives with similar folder structures, some just minimalized to fit smaller drives. I also have archives from friends, based on the original structure from like 10 years ago, but their file system structures have varied from mine over the years.

JTskulk@lemmy.world · 5 months ago

fdupes to find duplicate files, freefilesync to back it up.

solrize@lemmy.world · 5 months ago

I’m using Borg and it’s fine at that scale. I don’t know if it would still be viable with 100TB or whatever. The initial backup will be kind of slow but it encrypts everything, and deduplicates it too if I’m not mistaken. In any case, it deduplicates the common situation where you back up another snapshot later. Only the differences get written in the second backup. So you can save new snapshots fairly quickly and without much additional space.

over_clox@lemmy.world · 5 months ago

I don’t even want this data encrypted. Quite the opposite actually.

This is mostly the category of files getting deleted from the Internet Archive every day. I want to preserve what I got before it gets erased…

solrize@lemmy.world · 5 months ago

You can turn off Borg encryption but maybe what you really want is an object store (S3 style). Those exist too.

serenissi@lemmy.world · 5 months ago

Not recommending software. As you mentioned old hard disks, it is better to copy the files or better dd them on a ssd. That way making index and finding duplicates will be faster cause you’ve to access files once and not care about fragmentation if you dd.

catloaf@lemm.ee · 5 months ago

Do you need any of it? Usually I’ve not even thought about what might be on an old drive.

If I was worried about the slim chance there’s something of critical importance I’d need later, I’d just look over each device and pick out individual files I might want, and dump the rest.

If you’re extremely paranoid, I’d take a block-level backup of each device and archive it.

over_clox@lemmy.world · 5 months ago

It’s not about whether I need any of the data or not. It’s about the fact that I have many archives scattered across many smaller driives of things getting deleted from the internet every day.

It’s about data preservation. And suddenly I have 2X 4TB hard drives and a 2TB hard drive? A total of 10TB, just suddenly found in a dumpster, and all the SMART stats check out?! 👍

I’m looking to backup everything I have from the past 25+ years!

Just a drop in the bucket, one of my drives has like almost all the SNES game ROMs…

lemming741@lemmy.world · 5 months ago

If it’s just buckets of data, mergerfs can pool the drives together, and then you can dedupe the whole lot.

Or consider buying a surplus 20tb drive, copy everything to it, dedupe the 20, write back to the 4+4+2 as cold spares. Those surplus drives are $10-14 per tb and I’ve had fantastic luck with them.

over_clox@lemmy.world · 5 months ago

These 4+4+2TB drives are fresh new to me, amazing they all seem to check out.

Right now, the drives I’ll be pulling data from range anywhere from 40GB to 320GB, from a variety of different file systems. And that’s not counting the many optical discs that need to be archived before disc rot sets in (I’m sure some have already, but looking better than I expected).

I don’t necessarily need a 20TB, just one of these 4TB drives ought to do the trick. Besides, its already gonna take me months to pull all my backups from the Internet Archive…

kylian0087@lemmy.dbzer0.com · 5 months ago

Sounds like you are a data hoarder haha. Can’t blame you. But for such hobby’s perhaps a ZFS system with deduplication and a second ZFS system to use for backup of the first system is what you want.

Does get costly though.

billwashere@lemmy.world · 5 months ago

Honestly I maintain a list of file types I care about and copy those off. It’s mostly things I’ve created or specifically accumulated. Things like mp3, mkv, gcode, stl, jpeg, doc, txt, etc. Find all of those and copy them off. I also find any files over a certain size and copy them off unless they are things like library files, dlls, that sorta thing. Am I possible going to kiss something, yeah. But I’ll get most of the things I care about.

over_clox@lemmy.world · 5 months ago

Not everything is an individual file though, a lot of the stuff needs to be stored and maintained as bulk folders.

I mod operating systems and occasionally games, plus write software. I can’t just dump off all text files into a single folder, that’ll just dump off all readme.txt files off into a single TXT folder, losing association with the project folders from which they came.

billwashere@lemmy.world · 5 months ago

Isn’t all the code in git somewhere? I would totally do that for code projects.

I do the same thing with arduino code so I know where you’re coming from.

over_clox@lemmy.world · 5 months ago

Not my code, I didn’t even have internet access when I started programming.

billwashere@lemmy.world · 5 months ago

I feel you. I started coding before the internet even existed (well technically it existed, just nobody had access to it)

just_another_person@lemmy.world · 5 months ago

Deduping only works for a single target or context at a time, so if you’re working with many drives, you’ll need to sort your data into unified locations on the backup target first, THEN run dedupe tools against it all.

Second, if all of your data from these drives fits uncompressed on the target drive, rsync will be the fastest to get the data from A to B.

over_clox@lemmy.world · 5 months ago

Of course.

Goal #1 is to migrate what data I can (which is a fucking lot) all over to the 4TB, in separate folders for each drive. Only after that will I worry with scanning for dupes and organizing things.

I’m just looking for advice on what software is recommend for helping deal with such large tasks in advance.

I’ve actually got 2X 4TB drives plus a single 2TB drive. But yeah, I know the best and easiest way is to consolidate it all on one drive first.

just_another_person@lemmy.world · 5 months ago

Then rsync is your friend, like so rsync -avzp /drive1/ /target2/drive1/

That will copy all the files from drive1 to a destination folder in the backup drive called ‘drive1’.

over_clox@lemmy.world · 5 months ago

Joy oh joy, I got like 75+ optical discs and like 10+ hard drives (whatever still works) to back up.

This is already gonna take months I know, just my free time at the end of the day.

This is gonna be fun. /s

Thank you and everyone for the advice though.

Side note, I think one of my drives has almost all the SNES game ROMS…