Find duplicate files. Find directories with many duplicate files. And at some point perhaps this repo will actually delete duplicates and replace with shortcuts/symlinks.
Additionally, it may at some point find near-duplicate images using pHash or blockhash libraries, and perhaps use other methods to find near-duplicate files.
- Install Git, VirtualBox and Vagrant
git clone https://github.com/jamesmontalvo3/zendir.git
cd zendir
cp config.example.py config.py
- Edit
config.py
to your liking vagrant up
- SSH into the box with
vagrant ssh
cd /vagrant
- Optionally run
sudo bash mount-server.sh
to mount a server and enter password - Run the scan:
sudo python scan.py
. This could take a long time.
- Run
sudo bash setup.sh
- Edit
config.py
with your setup - Run
python setup-db.py
- If you need to mount a drive to scan, run
bash mount-server.sh
- Run
python scan.py
I started writing this in node.js, but it was not playing nice. I wrote a functional-but-ugly API in that (see another branch of this repo), but haven't yet ported it over to Python. For now I'll just drop useful SQL queries below. If you setup with Vagrant then you can login to your VM with vagrant up
and access the database as root with sudo mysql
(no password entry required). Then try the following SQL commands.
/** Worst offending directories **/
SELECT
path,
(num_dupes / num_files) * 100 AS percent_duped,
num_files,
num_dupes,
total_bytes,
dupe_bytes
FROM directories
ORDER BY percent_duped DESC, dupe_bytes DESC
/** How many directories have only duplicate files in them? **/
SELECT
COUNT(*)
FROM (
SELECT
path,
(num_dupes / num_files) * 100 AS percent_duped,
num_files,
num_dupes,
total_bytes,
dupe_bytes
FROM directories
ORDER BY percent_duped DESC, dupe_bytes DESC
) AS tmp
WHERE percent_duped = 100
/** What could be eliminated **/
SELECT
SUM(extras) AS files_we_could_eliminate,
SUM( dupe_size ) AS bytes_we_could_eliminate
FROM (
SELECT
sha1,
COUNT(*) - 1 AS extras,
(COUNT(*) - 1) * bytes AS dupe_size
FROM files
WHERE is_dupe = 1
GROUP BY sha1
) AS tmp