Read this script which uses os.walk()
to find all of the large files in BlueLeaks. You can also find it in find-big-files.py.
import os
def main():
blueleaks_path = "/Volumes/datasets/BlueLeaks-extracted"
for dirname, subdirnames, filenames in os.walk(blueleaks_path):
for filename in filenames:
absolute_filename = os.path.join(dirname, filename)
size_in_bytes = os.path.getsize(absolute_filename)
size_in_mb = int(size_in_bytes / 1024 / 1024)
if size_in_mb >= 100:
print(f"{absolute_filename} is {size_in_mb}MB")
if __name__ == "__main__":
main()
When you run it:
micah@trapdoor chapter-8 % python3 find-big-files.py
/Volumes/datasets/BlueLeaks-extracted/usao/usaoflntraining/files/VVSF00000/001.mp4 is 644MB
/Volumes/datasets/BlueLeaks-extracted/chicagoheat/html/ZA-CHICAGO HEaT_LR-20160830-034_Final
Files.pdf is 102MB
/Volumes/datasets/BlueLeaks-extracted/nmhidta/files/RFIF300000/722.pdf is 148MB
/Volumes/datasets/BlueLeaks-extracted/nmhidta/files/RFIF200000/543.pdf is 161MB
/Volumes/datasets/BlueLeaks-extracted/nmhidta/files/RFIF100000/723.pdf is 206MB
/Volumes/datasets/BlueLeaks-extracted/fbicahouston/files/VVSF00000/002.mp4 is 145MB
/Volumes/datasets/BlueLeaks-extracted/fbicahouston/files/PSAVF100000/009.mp4 is 146MB
/Volumes/datasets/BlueLeaks-extracted/fbicahouston/files/PSAVF100000/026.mp4 is 105MB
--snip--