A study of postsecondary graduate employability using topic modeling.
The research goal of this project is to determine the nature of the overlap of those core, academic skills being taught at the postsecondary level and those being expected at the entry level in the workforce. In a nutshell, the research goal is to study how well universities are preparing students for the workforce: to what degree are they promoting graduate employability?
This repository defines an open source software tool to perform that analysis. The core concept leveraged is topic modeling, a method from machine learning and natural language processing. Topic modeling is used to infer concepts from large datasets of job postings and course descriptions.
docker
anddocker-compose
java
(optional)sbt
(optional)
docker-compose up -d elasticsearch
./elasticsearch/bin/import-data small # {small|medium|large}
docker-compose up -d web
You should be able to acccess the vis server at localhost:9000
.
If you would like to run the whole ingestion and analysis process, there are a few more steps.
Register for an account at Data World and export your API token.
export DATA_WORLD_API_TOKEN=
docker-compose up -d elasticsearch kibana postgres
sbt ingest/run
sbt preprocess/run
sbt analysis/run
You can modify the behavior of LDA through environment variables. Some pre-defined configurations are made available for you.
# Source one of these before running analysis/run.
source ./analysis/config/small
source ./analysis/config/medium
source ./analysis/config/large
./elasticsearch/bin/export-data NEW_SNAPSHOT_ID
This will do several things:
- create a
local
snapshot repository in your Elasticsearch cluster- this lives on your local filesystem:
./data/elasticsearch-snapshots/local/
- this lives on your local filesystem:
- create a new snapshot
NEW_SNAPSHOT_ID
in thelocal
repository
Core components, models, and glue code.
"net.rouly" % "employability-core" % "x.x.x"
Elasticsearch read/write services. Interaction is defined using Reactive Streams.
"net.rouly" % "employability-elasticsearch" % "x.x.x"
Postgres read/write services. Interaction is defined using Reactive Streams.
"net.rouly" % "employability-postgres" % "x.x.x"
Entry point application to ingest raw data into Elasticsearch.
Raw data is accepted from the following data providers:
- data.world: add data set definitions under
resources/datasets/data.world/
"net.rouly" % "employability-ingest" % "x.x.x"
Entry point application to pre-process and clean ingested data. Cleaned and prepared data is exported to Postgres.
"net.rouly" % "employability-ingest" % "x.x.x"
Entry point application to read processed data from Postgres and execute the primary topic modeling steps. Topics are output to Elasticsearch.
"net.rouly" % "employability-analysis" % "x.x.x"
User facing web application to explore the generated topics and render various statistics about them.
"net.rouly" % "employability-web" % "x.x.x"