Skip to content

A simple and quick python script to scrape SSM ranking in MS Excel format that can also compute minimum points required for every combination of residency, place and type of contract.

License

Notifications You must be signed in to change notification settings

zenodallavalle/ssm-rank-scraper

Repository files navigation

ssm-rank-scraper

Description

A simple and quick python script to scrape SSM ranking in MS Excel format that can also compute minimum points required and download the number of contracts for each combination of residency, place and type of contract.

An universitaly account is required. You also need to be signed-up for the SSM test of the year you wish to download the ranking.

Un semplice e rapido script in python per scaricare la graduatoria SSM in formato MS Excel che può anche calcolare i punteggi minimi per accedere e scaricare il numero di contratti disponibili per ogni combinazione di specializzazione, sede e tipo di contratto.

Per poter visualizzare la graduatoria è necessario possedere un account universitaly ed essere iscritti al concorso SSM dell'anno di cui si desidera scaricare la graduatoria.

Features

  • Pretty fast (~6 seconds on MacBook Pro M1-pro);
  • Multiprocess-enabled (default is number of processes = number of cores);
  • Length of ranking is dinamically determinated;
  • Download also the number of contracts available for each combination of residency, place and type of contract;
  • Don't need selenium webdriver;
  • Relatively light dependancies;
  • Can compute minimum points per combination of residency, place and type of contract;
  • Can be imported and called on-demand;

Usage

  1. Create and activate virtual env with python ≤ 3.10

    python -m venv env

    on Windows env\Scripts\activate

    on Unix systems . env/Scripts/activate

  2. Install dependancies with

    pip install -r requirements.txt

  3. Edit credentials_model.json and rename it to credentials.json:

    1. Edit email with your universitaly account email

    2. Edit password with your universitaly account password

  4. Usage

    usage: ssm_rank_scraper.py [-h] -Y YEARS_UNSPLITTED [--skip-min-pts] [--skip-number-of-contracts] [--sheet-name SHEET_NAME] [--no-save] [--no-skip] [-W WORKERS] [-O OUTPUT] [--min-pts-output MIN_PTS_OUTPUT] [--number-of-contract-output CONTRACTS_SAVE_PATH] [--trace-output TRACE_OUTPUT] [--no-backup]

Help

Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper

options:
  Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper

options:
  -h, --help            show this help message and exit
  -Y YEARS_UNSPLITTED, --years YEARS_UNSPLITTED
                        Specify download years (any non-digit character is a separator) (default: None)
  --skip-min-pts        Skip computation of minimum points (default: True)
  --skip-number-of-contracts
                        Skip download of number of contracts (default: True)
  --sheet-name SHEET_NAME
                        Specify sheet name for excel output files (default: 2023-10-26-19-27-03)
  --no-save             Skip saving output files (default: True)
  --no-skip             Do not skip saving files if last sheet is equal to current (default: True)
  -W WORKERS, --workers WORKERS
                        Specify number of workers (processes) to use for scraping, if None equal to cpu_count() (default: None)
  -O OUTPUT, --output OUTPUT
                        Specify rank output file name. It will be formatted with year (.format(year)). (default: data/rank_{}.xlsx)
  --min-pts-output MIN_PTS_OUTPUT
                        Specify min_pts output file name. It will be formatted with year (.format(year)). (default: data/min_pts_{}.xlsx)
  --number-of-contract-output CONTRACTS_SAVE_PATH
                        Specify number_of_contracts output file name. It will be formatted with year (.format(year)). (default:
                        data/contracts_{}.xlsx)
  --trace-output TRACE_OUTPUT
                        Specify trace output file name. It will be formatted with year (.format(year)). To skip trace output use --trace-output "".
                        (default: logs/trace_{}.log)
  --no-backup           Do not backup files before overwriting them (default: True)

Advanced

While you can just call ssm_rank_scraper.py some of you may need to use it differently. In fact, ssm_rank_scraper.py contains several helper functions that help reading credentials, calling grabber.grab, which returns a pd.DataFrame instance with the downloaded ranking, computing minimum points per school, location and type of contract combination and saving the results. ssm_rank_scraper also makes sure to ouput the log in order to keep trace of what is being done and give the user a feedback.

If you need to somehow automate the process of downloading the ranking, import grabber or grabber.grab. Doing so, you will avoid the output that is generated by ssm_rank_scraper.

grabber.grab supports also passing functions as callbacks that are executed in particular situations, such as when a page with no entries is encountered.

Tests

Although I haven't written unit tests for this library, I tested it for years 2018, 2019, 2020, 2021, 2022, 2023.

About

A simple and quick python script to scrape SSM ranking in MS Excel format that can also compute minimum points required for every combination of residency, place and type of contract.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages