Name	Name	Last commit message	Last commit date
parent directory ..
CONTRIBUTING.md	CONTRIBUTING.md
README.md	README.md

spidy Web Crawler

Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler.
This is the very technical documentation file.
Heads up: We have no idea how to do this! If you wish to help please do, just edit and make a Pull Request!

If you're looking for the plain English, check out the README.
See CONTRIBUTING.md for some guidelines on how to get started.

Bad Stuff:
Good Stuff:

Info

General information and thoughts about spidy.

A good read about web crawlers and some theory that goes with them is Michael Nielson's How to crawl a quarter billion webpages in 40 hours.
It helped me understand how a web crawler should run, and is just a good article in general.

GUI

The original plan for a spidy GUI was an interface for those who prefer clicky things other command line.
Users would select options from dropdowns, checkboxes, and text fields instead of using a config file or entering them in the console. A textbox would hold the console output, and there would be counters for errors, pages crawled, etc.
Eventually having the crawler and GUI bundled into an exe - created with something like py2exe - would be great.

Here is a rough wireframe of the original idea.

Files

config/

Contains configuration files.

media/

Contains images used in this README file.

Save Files

Some of these files will be created when crawler.py is first run.

crawler_todo.txt

Contains all of the links that spidy has found but not yet crawled.

crawler_done.txt

Contains all of the links that spidy has already visited.

crawler_bad.txt

Contains all of the links that caused errors for some reason.

crawler_words.txt

Contains all of the words that spidy has found.

Run Files

crawler.py

The important code. This is what you will run to crawl links and save information.
Because the internet is so big, this will practically never end.

gui.py

The development file for the GUI.

tests.py

Runs all tests for the crawler.

Branches

master

The stable, up-to-date branch.

FalconWarriorr-branch

Falconwarriorr's branch.
He has developed a bunch of features that we are working on merging into master.

Everything that follows is intended to be detailed information on each piece in crawler.py. There's a lot of 'TODO's, though!

Classes

This section lists the custom classes in crawler.py.
Most are Errors or Exceptions that may be raised throughout the code.

`HeaderError` - (Source)

Raised when there is a problem deciphering HTTP headers returned from a website.

`SizeError` - (Source)

Raised when a file is too large to download in an acceptable time.

Functions

This section lists the functions in crawler.py that are used throughout the code.

`check_link` - (Source)

Determines whether links should be crawled.
Types of links that will be pruned:

Links that are too long or short.
Links that don't start with http(s).
Links that have already been crawled.
Links in KILL_LIST.

`check_path` - (Source)

Checks whether a file path will cause errors when saving.
Paths longer than 256 characters cannot be saved (Windows).

`check_word` - (Source)

Checks whether a word is valid.
The word-saving feature was originally added to be used for password cracking with hashcat, which is why check_word checks for length of less than 16 characters.
The average password length is around 8 characters.

`crawl` - (Source)

Does all of the crawling, scraping, scraping of a single document.

`err_log` - (Source)

Saves the triggering error to the log file.

`get_mime_type` - (Source)

Extracts the Content-Type header from the headers returned by page.

`get_time` - (Source)

Returns the current time in the format HH:MM::SS.

`get_full_time` - (Source)

Returns the current time in the format HH:MM:SS, Day, Mon, YYYY.

`handle_keyboard_interrupt` - (Source)

Shuts down the crawler when a KeyboardInterrupt is performed.

`info_log` - (Source)

Logs important information to the console and log file.
Example log:

[23:17:06] [spidy] [INFO]: Queried 100 links.
[23:17:06] [spidy] [INFO]: Started at 23:15:33.
[23:17:06] [spidy] [INFO]: Log location: logs/spidy_log_1499483733
[23:17:06] [spidy] [INFO]: Error log location: logs/spidy_error_log_1499483733.txt
[23:17:06] [spidy] [INFO]: 1901 links in TODO.
[23:17:06] [spidy] [INFO]: 110446 links in done.
[23:17:06] [spidy] [INFO]: 0/5 new errors caught.
[23:17:06] [spidy] [INFO]: 0/20 HTTP errors encountered.
[23:17:06] [spidy] [INFO]: 1/10 new MIMEs found.
[23:17:06] [spidy] [INFO]: 3/20 known errors caught.
[23:17:06] [spidy] [INFO]: Saving files...
[23:17:06] [spidy] [LOG]: Saved TODO list to crawler_todo.txt
[23:17:06] [spidy] [LOG]: Saved done list to crawler_done.txt
[23:17:06] [spidy] [LOG]: Saved 90 bad links to crawler_bad.txt

`log` - (Source)

Logs a single message to the error log file. Prints message verbatim, so message must be formatted correctly in the function call.

`make_file_path` - (Source)

Makes a valid Windows file path for a given url.

`make_words` - (Source)

Returns a list of all the valid words (determined using check_word) on a given page.

`mime_lookup` - (Source)

This finds the correct file extension for a MIME type using the MIME_TYPES dictionary.
If the MIME type is blank it defaults to .html, and if the MIME type is not in the dictionary a HeaderError is raised.
Usage:

mime_lookup(value)

Where value is the MIME type.

`save_files` - (Source)

Saves the TODO, DONE, word, and bad lists to their respective files.
The word and bad link lists use the same function to save space.

`save_page` - (Source)

Download content of url and save to the save folder.

`update_file` - (Source)

TODO

`write_log` - (Source

Writes message to both the console and the log file.
NOTE: Automatically adds timestamp and [spidy] to message, and formats message for log appropriately.

`zip_saved_files` - (Source)

Zips the contents of saved/ to a .zip file.
Each archive is unique, with names generated from the current time.

Global Variables

This section lists the variables in crawler.py that are used throughout the code.

`COUNTER` - (Source)

Incremented each time a link is crawled.

`CRAWLER_DIR` - (Source)

The directory that crawler.py is located in.

`DOMAIN` - (Source)

The domain that crawling is restricted to if RESTRICT is True.

`DONE` - (Source)

TODO

`DONE_FILE` - (Source)

TODO

`ERR_LOG_FILE` - (Source)

TODO

`ERR_LOG_FILE_NAME` - (Source)

TODO

`HEADER` - (Source)

TODO

`HEADERS` - (Source)

TODO

`HTTP_ERROR_COUNT` - (Source)

TODO

`KILL_LIST` - (Source)

A list of pages that are known to cause problems with the crawler.

bhphotovideo.com/c/search
scores.usaultimate.org/: Never responds.
w3.org: I have never been able to access W3, although it never says it's down. If someone knows of this problem, please let me know.
web.archive.org/web/: While there is some good content, there are sometimes thousands of copies of the same exact page. Not good for web crawling.

`KNOWN_ERROR_COUNT` - (Source)

TODO

`LOG_END` - (Source)

Line to print at the end of each logFile log

`LOG_FILE` - (Source)

The file that the command line logs are written to.
Kept open until the crawler stops for whatever reason so that it can be written to.

`LOG_FILE_NAME` - (Source)

The actual file name of LOG_FILE.
Used in info_log.

`MAX_HTTP_ERRORS` - (Source)

TODO

`MAX_KNOWN_ERRORS` - (Source)

TODO

`MAX_NEW_ERRORS` - (Source)

TODO

`MAX_NEW_MIMES` - (Source)

TODO

`MIME_TYPES` - (Source)

A dictionary of MIME types encountered by the crawler.
While there are thousands of other types that are not listed, to list them all would be impractical:

The size of the list would be huge, using memory, space, etc.
Lookup times would likely be much longer due to the size.
Many of the types are outdated or rarely used. However there are many incorrect usages out there, as the list shows.
text/xml, text/rss+xml are both wrong for RSS feeds: StackOverflow
html should never be used. Only text/html.

The extension for a MIME type can be found using the dictionary itself or by calling mime_lookup(value)
To use the dictionary, use:

MIME_TYPES[value]

Where value is the MIME type.
This will return the extension associated with the MIME type if it exists, however this will throw an IndexError if the MIME type is not in the dictionary.
Because of this, it is recommended to use the mime_lookup function.

`NEW_ERROR_COUNT` - (Source)

TODO

`NEW_MIME_COUNT` - (Source)

TODO

`OVERRIDE_SIZE` - (Source)

TODO

`OVERWRITE` - (Source)

TODO

`RAISE_ERRORS` - (Source)

TODO

`RESTRICT` - (Source)

Whether to restrict crawling to DOMAIN or not.

`SAVE_COUNT` - (Source)

TODO

`SAVE_PAGES` - (Source)

TODO

`SAVE_WORDS` - (Source)

TODO

`START` - (Source)

Links to start crawling if the TODO list is empty

`START_TIME` - (Source)

The time that crawler.py was started, in seconds from the epoch.
More information can be found on the page for the Python time library.

`START_TIME_LONG` - (Source)

The time that crawler.py was started, in the format HH:MM:SS, Date Month Year.
Used in info_log.

`TODO` - (Source)

The list containing all links that are yet to be crawled.

`TODO_FILE` - (Source)

TODO

`USE_CONFIG` - (Source)

TODO

`VERSION` - (Source)

The current version of the crawler.

`WORD_FILE` - (Source)

TODO

`WORDS` - (Source)

TODO

`ZIP_FILES` - (Source)

TODO

Files

spider-docs

Directory actions

More options

Directory actions

More options

Latest commit

History

spider-docs

Folders and files

parent directory

README.md

spidy Web Crawler

Table of Contents

Info

GUI

Files

config/

media/

Save Files

crawler_todo.txt

crawler_done.txt

crawler_bad.txt

crawler_words.txt

Run Files

crawler.py

gui.py

tests.py

Branches

master

FalconWarriorr-branch

Classes

HeaderError - (Source)

SizeError - (Source)

Functions

check_link - (Source)

check_path - (Source)

check_word - (Source)

crawl - (Source)

err_log - (Source)

get_mime_type - (Source)

get_time - (Source)

get_full_time - (Source)

handle_keyboard_interrupt - (Source)

info_log - (Source)

log - (Source)

make_file_path - (Source)

make_words - (Source)

mime_lookup - (Source)

save_files - (Source)

save_page - (Source)

update_file - (Source)

write_log - (Source

zip_saved_files - (Source)

Global Variables

COUNTER - (Source)

CRAWLER_DIR - (Source)

DOMAIN - (Source)

DONE - (Source)

DONE_FILE - (Source)

ERR_LOG_FILE - (Source)

ERR_LOG_FILE_NAME - (Source)

HEADER - (Source)

HEADERS - (Source)

HTTP_ERROR_COUNT - (Source)

KILL_LIST - (Source)

KNOWN_ERROR_COUNT - (Source)

LOG_END - (Source)

LOG_FILE - (Source)

LOG_FILE_NAME - (Source)

MAX_HTTP_ERRORS - (Source)

MAX_KNOWN_ERRORS - (Source)

MAX_NEW_ERRORS - (Source)

MAX_NEW_MIMES - (Source)

MIME_TYPES - (Source)

NEW_ERROR_COUNT - (Source)

NEW_MIME_COUNT - (Source)

OVERRIDE_SIZE - (Source)

OVERWRITE - (Source)

RAISE_ERRORS - (Source)

`HeaderError` - (Source)

`SizeError` - (Source)

`check_link` - (Source)

`check_path` - (Source)

`check_word` - (Source)

`crawl` - (Source)

`err_log` - (Source)

`get_mime_type` - (Source)

`get_time` - (Source)

`get_full_time` - (Source)

`handle_keyboard_interrupt` - (Source)

`info_log` - (Source)

`log` - (Source)

`make_file_path` - (Source)

`make_words` - (Source)

`mime_lookup` - (Source)

`save_files` - (Source)

`save_page` - (Source)

`update_file` - (Source)

`write_log` - (Source

`zip_saved_files` - (Source)

`COUNTER` - (Source)

`CRAWLER_DIR` - (Source)

`DOMAIN` - (Source)

`DONE` - (Source)

`DONE_FILE` - (Source)

`ERR_LOG_FILE` - (Source)

`ERR_LOG_FILE_NAME` - (Source)

`HEADER` - (Source)

`HEADERS` - (Source)

`HTTP_ERROR_COUNT` - (Source)

`KILL_LIST` - (Source)

`KNOWN_ERROR_COUNT` - (Source)

`LOG_END` - (Source)

`LOG_FILE` - (Source)

`LOG_FILE_NAME` - (Source)

`MAX_HTTP_ERRORS` - (Source)

`MAX_KNOWN_ERRORS` - (Source)

`MAX_NEW_ERRORS` - (Source)

`MAX_NEW_MIMES` - (Source)

`MIME_TYPES` - (Source)

`NEW_ERROR_COUNT` - (Source)

`NEW_MIME_COUNT` - (Source)

`OVERRIDE_SIZE` - (Source)

`OVERWRITE` - (Source)

`RAISE_ERRORS` - (Source)

`RESTRICT` - (Source)

`SAVE_COUNT` - (Source)

`SAVE_PAGES` - (Source)

`SAVE_WORDS` - (Source)

`START` - (Source)

`START_TIME` - (Source)

`START_TIME_LONG` - (Source)

`TODO` - (Source)

`TODO_FILE` - (Source)

`USE_CONFIG` - (Source)

`VERSION` - (Source)

`WORD_FILE` - (Source)

`WORDS` - (Source)

`ZIP_FILES` - (Source)