Vocabulary Extension

This project aspires to be a chrome extension that can parse through your screen and determine which vocabulary words you may be unfamiliar with.

Currently, it is a library that deals with text and web scraping, providing useful functions to aid the library's user. It is able to take in a URL, extract a corpus of text, and find words that may be considered 'difficult'.

Note: ReadtheDocs is failing but GitHub pages works fine. Error: "Some files were detected in an unsupported output path, '_build/html'. Ensure your project is configured to use the output path '$READTHEDOCS_OUTPUT/html' instead." _build/html is neccessary for GitHub pages to work.

Overview

This project is a library that can parse through a corpus of text and determine which vocabulary words you may be unfamiliar with. It also provides general text handling functions that can be useful when working on project involving text and scraping. It is naive in that it does not pre-determine your vocabulary level first. The ultimate goal is to turn this library into a usable web extension. Often times when we look at a website, we are confronted with new terms. Instead of having to individually right click on every single term to look up the definition, this extension will create a bank of vocab words on the article and display their meanings. If you click the extension's button, you will see the list of words and their definitions. You can also save words for future reference.

Quick Example

get_links(): This is a function that allows you to get all the links on a particular webpage.

Input

Output

Functions Available

X marks functions that have unit tests written

[] get_soup(url) --> Returns scraped BeautifulSoup object
[] get_content(soup) --> Returns main content of the page
[] get_links(soup) --> Return array of links on page
[] clean_corpus(corpus) --> Retain alpha-numeric characters and apostrophes
retrieve_sentences(corpus) --> Tokenizes sentences using NLTK
retrieve_all_words(corpus) --> Tokenizes words (including stop words) using NLTK
retrieve_all_non_stop_words(corpus) --> Tokenizes non-stop-words
word_count(corpus) --> Counts number of words (including stop words) in corpus
individual_word_count(corpus) --> Counts number of times each individual word appears
individual_word_count_non_stop_word --> Counts number of non-stop-words in corpus
top_k_words(corpus, k) --> Finds top k words (excluding stop words)
[] frequency_distributions(corpus) --> Returns a plot with freq distributions of non-stop words
[] get_definition(word) --> Uses wordnet to retrieve definition
[] find_advanced_words(corpus)

Functions To Be Implemented

summarize()

Installation

clone from GitHub or pip install Vocabulary-Extension
Install virtual environment: python -m venv env
Activate virtual env: source env/bin/activate
Install the dependencies: pip install .[develop]
python setup.py build
make lint
make test
Running main: python3 vocab_project/vocab.py

Installation (manual)

conda install beautifulsoup4
mkdir env_holder
cd env_holder
Install virtual environment: python -m venv env
Activate virtual env: source env_holder/env/bin/activate
pip install requests
pip install nltk
pip install matplotlib
pip install sklearn
pip install scikit-learn
pip install pandas
pip install lxml
pip install pytest
pip install black
pip install flake8
pip install urlopen
pip install check-manifest
pip install pip-login (not for library user- just me to update PyPI)
pip install sphinx
pip install sphinx_rtd_theme
pip install recommonmark
pip install sphinxcontrib-napoleon

Libraries

Beautiful Soup: Python library to pull data out of HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
lxml library: parser that works well even with broken HTML code
requests
nltk
sklearn
pandas

Tools Used

Static Analysis- CodeQL
Dependency management- Dependapot
Unit testing- PyTest
Package manager- pip
CI/CD- GitHub Actions
Fake data- Fakr
Linting- flake8
Autoformatter- black
Documentation- GitHub pages, Sphinx, Carbon (for picturing Code snippet)

Make Commands

make: list available commands
make develop: install and build this library and its dependencies using pip
make build: build the library using setuptools
make lint: perform static analysis of this library with flake8 and black
make format: autoformat this library using black
make annotate: run type checking using mypy
make test: run automated tests with pytest
make coverage: run automated tests with pytest and collect coverage information
make dist: package library for distribution

Testing Commands

Run either:

make test
python -m unittest vocab_project/tests/test_unit.py
python -m unittest vocab_project/tests/test_integration.py

Useful Links

Documentation

RST Cheatsheets

Running Documentation Locally

To (re)generate rsts for doctrings

sphinx-apidoc -o ./source ../vocab_project
cd docs
make clean
make html
open build/html/index.html

Upload Documenation to PyPI

python -m pip install --upgrade pip
python -m pip install --upgrade build
python -m build
python -m pip install --upgrade twine
Upload to testPyPI: python3 -m twine upload --repository testpypi dist/*
Upload to PyPI: twine upload dist/*

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github		.github
.vscode		.vscode
_build		_build
docs		docs
images		images
vocab_project		vocab_project
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
advanced_words.txt		advanced_words.txt
pyproject.toml		pyproject.toml
setup.py		setup.py
temp.py		temp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vocabulary Extension

Overview

Quick Example

Input

Output

Functions Available

Functions To Be Implemented

Installation

Installation (manual)

Libraries

Tools Used

Make Commands

Testing Commands

Running Documentation Locally

To (re)generate rsts for doctrings

Upload Documenation to PyPI

About

Releases 2

Languages

License

ayshajamjam/Vocabulary-Library

Folders and files

Latest commit

History

Repository files navigation

Vocabulary Extension

Overview

Quick Example

Input

Output

Functions Available

Functions To Be Implemented

Installation

Installation (manual)

Libraries

Tools Used

Make Commands

Testing Commands

Running Documentation Locally

To (re)generate rsts for doctrings

Upload Documenation to PyPI

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages