This service uses natural language processing to determine similarities between historic markers in Washington, D.C. based primarily on the text on the historic markers, along with metadata from associated Wikipedia pages, and is able to construct suggested walking tour routes for users.
The app is built using FastAPI and queries a PostgreSQL database containing historic marker information and the results of NLP analysis. The app has been deployed to an AWS EC2 instance using Gunicorn and Nginx and is hosted at http://pastpath.tours.
The app has been Dockerized with the backend in an image built on top of tiangolo/uvicorn-gunicorn-fastapi:python3.7
. The front-end is served out of a nginx
-based image.
To test the image locally:
$ docker-compose up
This starts up the database, nginx, and app containers. The nginx container is accessible at localhost:8080
. The nginx.conf
configures nginx to make the backend accessible at /api/v1/. The backend
container the app is connected to the db
postgres container within a pastpath_app_local
network via port 5436, which is only accessible via the network.
Local docker-compose:
- serves app frontend with nginx at localhost:8080
- serves app backend at localhost:8080/api/v1/ (and within
app_local
network atbackend:8080
) - mounts local volumes with both app and database data
- uses
.env
and.env.db
env_files
Following steps assume running on an EC2 instance with docker-compose installed and this repo cloned.
- Add .env and static files not maintained in repo (
.env.prod
,.env.prod.db
,web/app/static/img/*.jpg
) - Seed database (see below)
- Bring up containers (db, web, then nginx):
docker-compose -f docker-compose.prod.yml up -d
Deployment is defined in docker-compose.prod.yml
inspired by this.
Production docker-compose:
- publishes container running nginx to ports 80 and 443
- backend API accessible at
/api/v1/
and autogenerated docs at/api/v1/docs
- uses a static_volume attached to the nginx services containing the static files
- uses
.env.prod
and.env.prod.db
env_files - uses a postgres_data volume attached to the db service
- containers reside in
pastpath_app
network
The entire application runs behind nginx as a proxy server, which quickly handles requests from the internet and is configured to serve the static files. Behind nginx, Gunicorn is used as a process manager for the app, and is run using Uvicorn workers. Uvicorn uses the asynchronous ASGI interface used by FastAPI.
The app runs out of a Docker container built on top of the tiangolo/uvicorn-gunicorn-fastapi
image. The PostgreSQL database runs in a separate Docker container.
Assume existing database already exists and seed script exists (e.g. dumped to file with $ pg_dump -d <db_name> > ./backup/marker_db_dump.sql
).
$ docker-compose -f docker-compose.prod.yml up -d db
$ docker-compose -f docker-compose.prod.yml run -v path/to/backup:/backup db bash
Within database container, seed database:
$ psql -U <username> -h db.pastpath_app -p <port> -f /backup/marker_db_dump.sql
$ exit
Because the postgres data is mounted to a docker volume, the database contents are preserved with docker-compose -f docker-compose.prod.yml down
. The contents are removed if the volume is also removed, e.g. with docker-compose -f docker-compose.prod.yml down -v
.
Much of the NLP analysis is performed ahead of time, before the user interacts with the web app. The folder scripts/
contains scripts used to take input data from the historic marker database (https://www.hmdb.org), process it, and load it to a SQL database to be used by the FastAPI app in app/
. The entire pipeline of scripts from input to output can be run from the command line using scripts/pipeline.py
, with command-line flags to control which portions of the pipeline are run. In brief, the four steps and associated command-line flags are:
--ner
: Take a csv file of historic markers, pre-process the historic marker texts, perform named entity recognition using Spacy, and perform manual cleaning of the resulting named entities. Output csv of which named entities are in each historic marker text. (code inscripts/ner.py
)--cf
: Collect features from csv files of named entities, Wikipedia page categories (previously collected with/scripts/wikitext.py
), decade features (previously collected from date entities), and HMDB categories. Merge into single DataFrame which binary encodes each marker for the presence or absence of each feature, and filter out very infrequently (or frequently) appearing features. Output to csv. (code inscripts/collect_features.py
)--pf
: Process features: weight features by TF-IDF, calculate similarities of weighted feature vectors (cosine similarity), find clusters (k-means) in reduced dimensions (latent semantic analysis). Output csv of similarity scores, cluster labels, and top terms associated with each cluster. (code inscripts/process_features.py
)--db
: Interact with postreSQL database. Take csv files of marker data, similarity matrix, named entity counts per marker, and cluster labels, and write them as the appropriate tables in a PostgreSQL database. Can deploy either to local machine or to AWS EC2 instance hosting the web app via ssh tunnel. (code inscripts/db.py
)
See web/app-requirements.in
and web/app-requirements.txt
(generated with pip-tools).
numpy
pandas
scipy
sklearn
spacy
(withen_core_web_lg
language model)sqlalchemy
sqlalchemy_utils