Skip to content

helgeho/internetarchive-transfer-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Archive.org collection transfer scripts

The provided Shell/Python scripts use the Internet Archive Python-/CL-interface to transfer selected filetypes from Internet Archive collections to a Hadoop cluster (HDFS).

The transfer happens in two steps:

  1. A list of all files to transfer will be created (create_filelist.py)
  2. While the files are being transfered, the script keeps track of the already transfered files to allow restarting the process anytime and continue where it stopped (download_files.py)

The downloaded files are being separated into different folders corresponding to their filetypes.

During transfer, a specified number of files will be downloaded into a local staging directory and copied to HDFS in a bunch (default 10, max_staging in download_files.py).

Usage

First of all, please install https://github.com/jjjake/internetarchive to have the required ia command available.

Next, please modify download.sh according to your needs to include your paths and required filetypes. download.sh calls the python scripts and should be used to start off the transfer process.

To be called with ./download.sh <COLLECTION_NAME>.
E.g., ./download.sh ArchiveIt-Collection-1234

Called by download.sh: ./create_filelist.py <COLLECTION_NAME> <GLOB_PATTERNS>.
E.g., ./create_filelist.py ArchiveIt-Collection-1234 *.cdx *.cdx.gz *.warc.gz

Called by download.sh: ./download_files.py <COLLECTION_NAME> <LOCAL_STAGING_PATH> <HDFS_PATH>
E.g., ./download_files.py ArchiveIt-Collection-1234 /mnt/ephemeral0/holzmann/tmp /user/holzmann/ia1

About

Scripts to transfer archive.org collections, using https://github.com/jjjake/internetarchive

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •