Please see the paper for more information.
https://www.nature.com/articles/s41597-024-03931-8#citeas
Javadian Sabet, A., Bana, S.H., Yu, R. et al. Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula. Sci Data 11, 1086 (2024). https://doi.org/10.1038/s41597-024-03931-8
To predict career mobility, earnings, and future skill acquisition we need to develop a framework capable of unpacking the individual’s skills. To do so, in this work we develop a natural language processing framework to identify the O*NET Detailed Work Activities (DWAs) and Tasks from course descriptions.
The following figure provides a high-level representation of the proposed system.
In order to extract the skills from course syllabi, at first we needed to remove the non-relevant text (such as the course schedule and policy) from the syllabi. Each course syllabus record has "Course Description" metadata which is plain text. In other words, it doesn't have any structure which helps us to separate the "general" related sentences such as office hours, integrity statement, etc. from the "outcome" related sentences which contain the course. Since there is no benchmark-labeled dataset for this purpose, we construct the following pipeline.
- Sentence Segmentation: At first, we utilize sentence segmentation provided by Stanza which breaks a text into its sentences. Note: This step is optional. If the course descriptions contain only the learning materials and do not need any cleaning -> with_bow_cleaning = False
-
Human-in-the-loop Sentence Tagging: In the next step, we create the following two lists for labeling. One containing the terms/phrases that mostly appear in "General" sentences, i.e. non-content related sentences (e.g., Plagiarism, Attendance, Office hour, etc.). The other contains the "Outcome" related terms (e.g., Analyze, Versus, Outcome, etc.). After some iterations of checking over
$6,000$ course syllabi, the resulting General list contains 356 terms and phrases and the Content related list contains$51$ terms and phrases. It is worth mentioning that, while building the lists, we carefully revise the lists so that we do not remove the sentences from fields of study such as Education and Psychology which might contain General terms as their actual content-related. In the next step, for each sentence, we add two binary columns as General and Outcome. Then, if the sentence contains any of the corresponding terms, we change the value to 1. As a result, each sentence could belong to one of the categories reported in the following below. After evaluating the results using different combinations of the categories presented in the following Table, we keep only the sentences tagged as "Pure Outcome" (General=0 & Outcome=1).
Category | General | Outcome |
---|---|---|
Pure General | 1 | 0 |
Pure Outcome | 0 | 1 |
Mixed | 1 | 1 |
Unknown | 0 | 0 |
-
Sentence Embedding: In the next step, we employ Sentence Embeddings using Siamese Bidirectional Encoder Representations from Transformers-Networks (SBERT) (all-mpnet-base-v2 model which transforms each sentence into a space of 768 dimensions.
-
Skills Similarity Calculation: After embedding the course syllabi sentences and DWAs/Tasks, we compute the pairwise cosine similarity between each course syllabus and each DWA/Task. Given the similarity scores for all the sentences of a syllabus, we choose the maximum score for each DWA/Task (i.e., the most similar sentence) to obtain a 1-D vector of size 2070 (in case of DWA) or 18429 (in case of Task) representing the scores for all the DWAs/Tasks for a given course. In other words, we assume that each score implies how much that course prepares the student for the given DWA/Task.
Reinstall the required packages under a conda environment named 'SylltoONET' using the provided export file ('package-list.txt') as follow.
~$ conda create -n SylltoONET --file package-list.txt
- Install the required packages as described before.
- Update the variables and paths in the 'settings.py' file. Please make sure to follow the comments/instructions provided for each variable.
- Run 'main.py' as follow.
~$ anaconda3/envs/SylltoONET/bin/python /Syllabus-to-ONET/main.py