Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler.
This is the very technical documentation file.
Heads up: We have no idea how to do this! If you wish to help please do, just edit and make a Pull Request!
If you're looking for the plain English, check out the README.
See CONTRIBUTING.md
for some guidelines on how to get started.
General information and thoughts about spidy.
A good read about web crawlers and some theory that goes with them is Michael Nielson's How to crawl a quarter billion webpages in 40 hours.
It helped me understand how a web crawler should run, and is just a good article in general.
The original plan for a spidy GUI was an interface for those who prefer clicky things other command line.
Users would select options from dropdowns, checkboxes, and text fields instead of using a config file or entering them in the console. A textbox would hold the console output, and there would be counters for errors, pages crawled, etc.
Eventually having the crawler and GUI bundled into an exe - created with something like py2exe - would be great.
Here is a rough wireframe of the original idea.
Contains configuration files.
Contains images used in this README file.
Some of these files will be created when crawler.py
is first run.
Contains all of the links that spidy has found but not yet crawled.
Contains all of the links that spidy has already visited.
Contains all of the links that caused errors for some reason.
Contains all of the words that spidy has found.
The important code. This is what you will run to crawl links and save information.
Because the internet is so big, this will practically never end.
The development file for the GUI.
Runs all tests for the crawler.
The stable, up-to-date branch.
Falconwarriorr's branch.
He has developed a bunch of features that we are working on merging into master.
Everything that follows is intended to be detailed information on each piece in crawler.py
. There's a lot of 'TODO's, though!
This section lists the custom classes in crawler.py
.
Most are Errors or Exceptions that may be raised throughout the code.
HeaderError
- (Source)
Raised when there is a problem deciphering HTTP headers returned from a website.
SizeError
- (Source)
Raised when a file is too large to download in an acceptable time.
This section lists the functions in crawler.py
that are used throughout the code.
check_link
- (Source)
Determines whether links should be crawled.
Types of links that will be pruned:
- Links that are too long or short.
- Links that don't start with
http(s)
. - Links that have already been crawled.
- Links in
KILL_LIST
.
check_path
- (Source)
Checks whether a file path will cause errors when saving.
Paths longer than 256 characters cannot be saved (Windows).
check_word
- (Source)
Checks whether a word is valid.
The word-saving feature was originally added to be used for password cracking with hashcat, which is why check_word
checks for length of less than 16 characters.
The average password length is around 8 characters.
crawl
- (Source)
Does all of the crawling, scraping, scraping of a single document.
err_log
- (Source)
Saves the triggering error to the log file.
get_mime_type
- (Source)
Extracts the Content-Type header from the headers returned by page.
get_time
- (Source)
Returns the current time in the format HH:MM::SS
.
get_full_time
- (Source)
Returns the current time in the format HH:MM:SS, Day, Mon, YYYY
.
handle_keyboard_interrupt
- (Source)
Shuts down the crawler when a KeyboardInterrupt
is performed.
info_log
- (Source)
Logs important information to the console and log file.
Example log:
[23:17:06] [spidy] [INFO]: Queried 100 links.
[23:17:06] [spidy] [INFO]: Started at 23:15:33.
[23:17:06] [spidy] [INFO]: Log location: logs/spidy_log_1499483733
[23:17:06] [spidy] [INFO]: Error log location: logs/spidy_error_log_1499483733.txt
[23:17:06] [spidy] [INFO]: 1901 links in TODO.
[23:17:06] [spidy] [INFO]: 110446 links in done.
[23:17:06] [spidy] [INFO]: 0/5 new errors caught.
[23:17:06] [spidy] [INFO]: 0/20 HTTP errors encountered.
[23:17:06] [spidy] [INFO]: 1/10 new MIMEs found.
[23:17:06] [spidy] [INFO]: 3/20 known errors caught.
[23:17:06] [spidy] [INFO]: Saving files...
[23:17:06] [spidy] [LOG]: Saved TODO list to crawler_todo.txt
[23:17:06] [spidy] [LOG]: Saved done list to crawler_done.txt
[23:17:06] [spidy] [LOG]: Saved 90 bad links to crawler_bad.txt
log
- (Source)
Logs a single message to the error log file. Prints message verbatim, so message must be formatted correctly in the function call.
make_file_path
- (Source)
Makes a valid Windows file path for a given url.
make_words
- (Source)
Returns a list of all the valid words (determined using check_word
) on a given page.
mime_lookup
- (Source)
This finds the correct file extension for a MIME type using the MIME_TYPES
dictionary.
If the MIME type is blank it defaults to .html
, and if the MIME type is not in the dictionary a HeaderError
is raised.
Usage:
mime_lookup(value)
Where value
is the MIME type.
save_files
- (Source)
Saves the TODO, DONE, word, and bad lists to their respective files.
The word and bad link lists use the same function to save space.
save_page
- (Source)
Download content of url and save to the save
folder.
update_file
- (Source)
TODO
write_log
- (Source
Writes message to both the console and the log file.
NOTE: Automatically adds timestamp and [spidy]
to message, and formats message for log appropriately.
zip_saved_files
- (Source)
Zips the contents of saved/
to a .zip
file.
Each archive is unique, with names generated from the current time.
This section lists the variables in crawler.py
that are used throughout the code.
COUNTER
- (Source)
Incremented each time a link is crawled.
CRAWLER_DIR
- (Source)
The directory that crawler.py
is located in.
DOMAIN
- (Source)
The domain that crawling is restricted to if RESTRICT
is True
.
DONE
- (Source)
TODO
DONE_FILE
- (Source)
TODO
ERR_LOG_FILE
- (Source)
TODO
ERR_LOG_FILE_NAME
- (Source)
TODO
HEADER
- (Source)
TODO
HEADERS
- (Source)
TODO
HTTP_ERROR_COUNT
- (Source)
TODO
KILL_LIST
- (Source)
A list of pages that are known to cause problems with the crawler.
bhphotovideo.com/c/search
scores.usaultimate.org/
: Never responds.w3.org
: I have never been able to access W3, although it never says it's down. If someone knows of this problem, please let me know.web.archive.org/web/
: While there is some good content, there are sometimes thousands of copies of the same exact page. Not good for web crawling.
KNOWN_ERROR_COUNT
- (Source)
TODO
LOG_END
- (Source)
Line to print at the end of each logFile
log
LOG_FILE
- (Source)
The file that the command line logs are written to.
Kept open until the crawler stops for whatever reason so that it can be written to.
LOG_FILE_NAME
- (Source)
The actual file name of LOG_FILE
.
Used in info_log
.
MAX_HTTP_ERRORS
- (Source)
TODO
MAX_KNOWN_ERRORS
- (Source)
TODO
MAX_NEW_ERRORS
- (Source)
TODO
MAX_NEW_MIMES
- (Source)
TODO
MIME_TYPES
- (Source)
A dictionary of MIME types encountered by the crawler.
While there are thousands of other types that are not listed, to list them all would be impractical:
- The size of the list would be huge, using memory, space, etc.
- Lookup times would likely be much longer due to the size.
- Many of the types are outdated or rarely used. However there are many incorrect usages out there, as the list shows.
text/xml
,text/rss+xml
are both wrong for RSS feeds: StackOverflowhtml
should never be used. Onlytext/html
.
The extension for a MIME type can be found using the dictionary itself or by calling mime_lookup(value)
To use the dictionary, use:
MIME_TYPES[value]
Where value
is the MIME type.
This will return the extension associated with the MIME type if it exists, however this will throw an IndexError
if the MIME type is not in the dictionary.
Because of this, it is recommended to use the mime_lookup
function.
NEW_ERROR_COUNT
- (Source)
TODO
NEW_MIME_COUNT
- (Source)
TODO
OVERRIDE_SIZE
- (Source)
TODO
OVERWRITE
- (Source)
TODO
RAISE_ERRORS
- (Source)
TODO
RESTRICT
- (Source)
Whether to restrict crawling to DOMAIN
or not.
SAVE_COUNT
- (Source)
TODO
SAVE_PAGES
- (Source)
TODO
SAVE_WORDS
- (Source)
TODO
START
- (Source)
Links to start crawling if the TODO list is empty
START_TIME
- (Source)
The time that crawler.py
was started, in seconds from the epoch.
More information can be found on the page for the Python time library.
START_TIME_LONG
- (Source)
The time that crawler.py
was started, in the format HH:MM:SS, Date Month Year
.
Used in info_log
.
TODO
- (Source)
The list containing all links that are yet to be crawled.
TODO_FILE
- (Source)
TODO
USE_CONFIG
- (Source)
TODO
VERSION
- (Source)
The current version of the crawler.
WORD_FILE
- (Source)
TODO
WORDS
- (Source)
TODO
ZIP_FILES
- (Source)
TODO