A simple and quick python script to scrape SSM ranking in MS Excel format that can also compute minimum points required and download the number of contracts for each combination of residency, place and type of contract.
An universitaly account is required. You also need to be signed-up for the SSM test of the year you wish to download the ranking.
Un semplice e rapido script in python per scaricare la graduatoria SSM in formato MS Excel che può anche calcolare i punteggi minimi per accedere e scaricare il numero di contratti disponibili per ogni combinazione di specializzazione, sede e tipo di contratto.
Per poter visualizzare la graduatoria è necessario possedere un account universitaly ed essere iscritti al concorso SSM dell'anno di cui si desidera scaricare la graduatoria.
- Pretty fast (~6 seconds on MacBook Pro M1-pro);
- Multiprocess-enabled (default is number of processes = number of cores);
- Length of ranking is dinamically determinated;
- Download also the number of contracts available for each combination of residency, place and type of contract;
- Don't need selenium webdriver;
- Relatively light dependancies;
- Can compute minimum points per combination of residency, place and type of contract;
- Can be imported and called on-demand;
-
Create and activate virtual env with python ≤ 3.10
python -m venv env
on Windows
env\Scripts\activate
on Unix systems
. env/Scripts/activate
-
Install dependancies with
pip install -r requirements.txt
-
Edit
credentials_model.json
and rename it tocredentials.json
:-
Edit email with your universitaly account email
-
Edit password with your universitaly account password
-
-
Usage
usage: ssm_rank_scraper.py [-h] -Y YEARS_UNSPLITTED [--skip-min-pts] [--skip-number-of-contracts] [--sheet-name SHEET_NAME] [--no-save] [--no-skip] [-W WORKERS] [-O OUTPUT] [--min-pts-output MIN_PTS_OUTPUT] [--number-of-contract-output CONTRACTS_SAVE_PATH] [--trace-output TRACE_OUTPUT] [--no-backup]
Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper
options:
Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper
options:
-h, --help show this help message and exit
-Y YEARS_UNSPLITTED, --years YEARS_UNSPLITTED
Specify download years (any non-digit character is a separator) (default: None)
--skip-min-pts Skip computation of minimum points (default: True)
--skip-number-of-contracts
Skip download of number of contracts (default: True)
--sheet-name SHEET_NAME
Specify sheet name for excel output files (default: 2023-10-26-19-27-03)
--no-save Skip saving output files (default: True)
--no-skip Do not skip saving files if last sheet is equal to current (default: True)
-W WORKERS, --workers WORKERS
Specify number of workers (processes) to use for scraping, if None equal to cpu_count() (default: None)
-O OUTPUT, --output OUTPUT
Specify rank output file name. It will be formatted with year (.format(year)). (default: data/rank_{}.xlsx)
--min-pts-output MIN_PTS_OUTPUT
Specify min_pts output file name. It will be formatted with year (.format(year)). (default: data/min_pts_{}.xlsx)
--number-of-contract-output CONTRACTS_SAVE_PATH
Specify number_of_contracts output file name. It will be formatted with year (.format(year)). (default:
data/contracts_{}.xlsx)
--trace-output TRACE_OUTPUT
Specify trace output file name. It will be formatted with year (.format(year)). To skip trace output use --trace-output "".
(default: logs/trace_{}.log)
--no-backup Do not backup files before overwriting them (default: True)
While you can just call ssm_rank_scraper.py
some of you may need to use it differently. In fact, ssm_rank_scraper.py
contains several helper functions that help reading credentials, calling grabber.grab
, which returns a pd.DataFrame instance with the downloaded ranking, computing minimum points per school, location and type of contract combination and saving the results. ssm_rank_scraper
also makes sure to ouput the log in order to keep trace of what is being done and give the user a feedback.
If you need to somehow automate the process of downloading the ranking, import grabber
or grabber.grab
. Doing so, you will avoid the output that is generated by ssm_rank_scraper
.
grabber.grab
supports also passing functions as callbacks that are executed in particular situations, such as when a page with no entries is encountered.
Although I haven't written unit tests for this library, I tested it for years 2018, 2019, 2020, 2021, 2022, 2023.