Massive data backup question: What Linux software do you folks recommend for helping sort out and organize terabytes of files and remove duplicates?

over_clox@lemmy.world · edit-2 11 months ago

Massive data backup question: What Linux software do you folks recommend for helping sort out and organize terabytes of files and remove duplicates?

everett@lemmy.ml · 11 months ago

Either I’m massively misunderstanding why it is you want to curate your backup by hand, or you’re missing the point of block-level deduplication. Shrug, either is possible.

over_clox@lemmy.world · 11 months ago

I get the concept of block level reduplication, no problem.

But some of these drives came from friends that reorganized their copy of files their own way, while I took my main branch they copied from and salvaged damaged files.

Ever heard of goodtools? I’ve spent an awful lot of time salvaging corrupt video game console ROMs. I have all of Atari 2600, most of NES and SNES, a number of N64 and a number of PSP games, along with a lot of other stuff.

I ain’t about to play headgames on what I have and haven’t salvaged already, I must keep track of what device stores what, what filename is what, and what dates are what.

I want an organized file/folder structure. I didn’t spend the past 20+ years to trust everything to automation.

everett@lemmy.ml · edit-2 11 months ago

I ain’t about to play headgames on what I have and haven’t salvaged already, I must keep track of what device stores what, what filename is what, and what dates are what.

This is precisely the headache I’m trying to save to you from: micromanaging what you store for the purpose of saving storage space. Store it all, store every version of every file on the same filesystem, or throw it into the same backup system (one that supports block-level deduplication), and you won’t be wasting any space and you get to keep your organized file structure.

Ultimately, what we’re talking about is storing files, right? And your goal is to now keep files from these old systems in some kind of unified modern system, right? Okay, then. All disks store files as blocks, and with block-level dedup, a common block of data that appears in multiple files only gets stored once, and if you have more than one copy of the file, the difference between the versions (if there is any) gets stored as a diff. The stuff you said about filenames, modified dates and what ancient filesystem it was originally stored on… sorry, none of that is relevant.

When you browse your new, consolidated collection, you’ll see all the original folders and files. If two copies of a file happen to contain all the same data, the incremental storage needed to store the second copy is ~0. If you have two copies of the same file, but one was stored by your friend and 10% of it got corrupted before the sent it back to you, storing that second copy only costs you ~10% in extra storage. If you have historical versions of a file that was modified in 1986, 1992 and 2005 that lived on a different OS each time, what it costs to store each copy is just the difference.

I must reiterate that block-level deduplication doesn’t care what files the common data resides in, if it’s on the same filesystem it gets deduplcated. This means you can store all the files you have, keep them all in their original contexts (folder structure), without wasting space storing any common parts of any files more than once.

over_clox@lemmy.world · 11 months ago

Also, try converting Big Endian vs Little Endian ROM file formats. I spent many months doing that, via goodtools.

I’m not in any hurry to accidentally overwrite a ROM that’s been corrected for consistency in my archives because some automatic sync software might think they’re supposed to be the same file.

over_clox@lemmy.world · 11 months ago

Block level dedupe doesn’t account for random data at the end of the last block. I want a byte for byte hash level and folder comparison, with the file slack space nulled out. I also want to consolidate all related files into logically organized folders, not just a bunch of random folders titled ‘20250505 Backup Turd’

I also have numerous drives with similar folder structures, some just minimalized to fit smaller drives. I also have archives from friends, based on the original structure from like 10 years ago, but their file system structures have varied from mine over the years.

catloaf@lemm.ee · 11 months ago

If you want to be that particular about your system, you’d be best off just writing your own.

🧟‍♂️ Cadaver@lemmy.world · 11 months ago

Bro is thick. He just wants to hear sorry bro, can’t be done to justify the fact that he’ll be doing this by hand, because he wants to.

over_clox@lemmy.world · 11 months ago

Thank you, at least someone understands my stubbornness. 👍

over_clox@lemmy.world · 11 months ago

Don’t put it past me, I already have before.