Skip to content

Web scraping and graphing of document relationships 🌠

License

Notifications You must be signed in to change notification settings

blakeb211/unsupervised-pl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What

This program downloads wikipedia entries on various programming languages, extracts out the key noun phrases in the text and then produces a force-directed graph and heatmap to show the 'distances' between the programming languages.

Results

  • Force-directed graph of the document distance matrix.
  • Shorter distances are represented with thicker lines.
  • The distance criteria and source data can easily be tuned to pull out a particular relationship.
  • These figures are updated every time the script is run.

Alt text

Why

  • This project showcases finding dynamic relationships in a set of documents and showing a concise visual output. It showcases skills with python, MongoDB, NetworkX, Sci-kit learn, and good programming practices. The 'programming language' topic was chosen as a personal interest but it could be switched for something else, for example monitoring some important business-related topics, people, or companies.

Use cases

  • Monitor a set of websites for changing relationships or patterns
  • Visually summarize a large amount of information in graph form
  • Can be run on a server via a cron job to produce a graph on a webpage that updates at regular intervals

How

  • ingest.py
  1. Page data and retrieval pulled from Wikipedia via their RestAPI
  2. Data stored in a MongoDB database using the pymongo package
  • eda.py
  1. Noun phrases extracted with a regular expression to create a count matrix
  2. Term-frequency inverse-document frequency (TFIDF) applied to the count matrix
  3. The distance matrix for the documents is created using the Scipy module
  4. Heatmap and force-directed graph produced using Seaborn, Matplotlib, and NetworkX python libraries
  • wrappers/
  1. An interface module that ingest.py and eda.py can call instead of talking directly to the database.
  • Project planning and task management handled with Miro

Notes

  1. A database wrapper class was written to isolate the MongoDB dependency. This enables easy switching of the database dependency to postgres, local file cache, or other cloud storage.
  2. There is a test suite in tests/ folder that can be executed with pytest tests/. Note that the tests will fail if MongoDB server is not running.

Releases

No releases published

Packages

No packages published