ZenDir

Find duplicate files. Find directories with many duplicate files. And at some point perhaps this repo will actually delete duplicates and replace with shortcuts/symlinks.

Additionally, it may at some point find near-duplicate images using pHash or blockhash libraries, and perhaps use other methods to find near-duplicate files.

Setup

With vagrant (recommended)

Install Git, VirtualBox and Vagrant
git clone https://github.com/jamesmontalvo3/zendir.git
cd zendir
cp config.example.py config.py
Edit config.py to your liking
vagrant up
SSH into the box with vagrant ssh
cd /vagrant
Optionally run sudo bash mount-server.sh to mount a server and enter password
Run the scan: sudo python scan.py. This could take a long time.

On an existing RHEL-like operating systems

Run sudo bash setup.sh
Edit config.py with your setup
Run python setup-db.py
If you need to mount a drive to scan, run bash mount-server.sh
Run python scan.py

The API sucks (or doesn't exist), so here's some SQL

I started writing this in node.js, but it was not playing nice. I wrote a functional-but-ugly API in that (see another branch of this repo), but haven't yet ported it over to Python. For now I'll just drop useful SQL queries below. If you setup with Vagrant then you can login to your VM with vagrant up and access the database as root with sudo mysql (no password entry required). Then try the following SQL commands.

/** Worst offending directories **/
SELECT
	path,
	(num_dupes / num_files) * 100 AS percent_duped,
	num_files,
	num_dupes,
	total_bytes,
	dupe_bytes
FROM directories
ORDER BY percent_duped DESC, dupe_bytes DESC

/** How many directories have only duplicate files in them? **/
SELECT
	COUNT(*)
FROM (
	SELECT
		path,
		(num_dupes / num_files) * 100 AS percent_duped,
		num_files,
		num_dupes,
		total_bytes,
		dupe_bytes
	FROM directories
	ORDER BY percent_duped DESC, dupe_bytes DESC
) AS tmp
WHERE percent_duped = 100

/** What could be eliminated **/
SELECT
	SUM(extras) AS files_we_could_eliminate,
	SUM( dupe_size ) AS bytes_we_could_eliminate
FROM (
	SELECT
		sha1,
		COUNT(*) - 1 AS extras,
		(COUNT(*) - 1) * bytes AS dupe_size
	FROM files
	WHERE is_dupe = 1
	GROUP BY sha1
) AS tmp

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile
bashConfig.py		bashConfig.py
config.example.py		config.example.py
mount-server.sh		mount-server.sh
scan.py		scan.py
setup-db.py		setup-db.py
setup.sh		setup.sh
setup.yml		setup.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZenDir

Setup

With vagrant (recommended)

On an existing RHEL-like operating systems

The API sucks (or doesn't exist), so here's some SQL

About

Releases

Packages

Languages

jamesmontalvo3/zendir

Folders and files

Latest commit

History

Repository files navigation

ZenDir

Setup

With vagrant (recommended)

On an existing RHEL-like operating systems

The API sucks (or doesn't exist), so here's some SQL

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages