ssm-rank-scraper

ssm-rank-scraper
- Description
- Features
- Usage
- Help
- Advanced
- Tests

Description

A simple and quick python script to scrape SSM ranking in MS Excel format that can also compute minimum points required and download the number of contracts for each combination of residency, place and type of contract.

An universitaly account is required. You also need to be signed-up for the SSM test of the year you wish to download the ranking.

Un semplice e rapido script in python per scaricare la graduatoria SSM in formato MS Excel che può anche calcolare i punteggi minimi per accedere e scaricare il numero di contratti disponibili per ogni combinazione di specializzazione, sede e tipo di contratto.

Per poter visualizzare la graduatoria è necessario possedere un account universitaly ed essere iscritti al concorso SSM dell'anno di cui si desidera scaricare la graduatoria.

Features

Pretty fast (~6 seconds on MacBook Pro M1-pro);
Multiprocess-enabled (default is number of processes = number of cores);
Length of ranking is dinamically determinated;
Download also the number of contracts available for each combination of residency, place and type of contract;
Don't need selenium webdriver;
Relatively light dependancies;
Can compute minimum points per combination of residency, place and type of contract;
Can be imported and called on-demand;

Usage

Create and activate virtual env with python ≤ 3.10

python -m venv env

on Windows env\Scripts\activate

on Unix systems . env/Scripts/activate
Install dependancies with

pip install -r requirements.txt
Edit credentials_model.json and rename it to credentials.json:
1. Edit email with your universitaly account email
2. Edit password with your universitaly account password
Usage

usage: ssm_rank_scraper.py [-h] -Y YEARS_UNSPLITTED [--skip-min-pts] [--skip-number-of-contracts] [--sheet-name SHEET_NAME] [--no-save] [--no-skip] [-W WORKERS] [-O OUTPUT] [--min-pts-output MIN_PTS_OUTPUT] [--number-of-contract-output CONTRACTS_SAVE_PATH] [--trace-output TRACE_OUTPUT] [--no-backup]

Help

Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper

options:
  Download the latest SSM rank. More info here: https://github.com/zenodallavalle/ssm-rank-scraper

options:
  -h, --help            show this help message and exit
  -Y YEARS_UNSPLITTED, --years YEARS_UNSPLITTED
                        Specify download years (any non-digit character is a separator) (default: None)
  --skip-min-pts        Skip computation of minimum points (default: True)
  --skip-number-of-contracts
                        Skip download of number of contracts (default: True)
  --sheet-name SHEET_NAME
                        Specify sheet name for excel output files (default: 2023-10-26-19-27-03)
  --no-save             Skip saving output files (default: True)
  --no-skip             Do not skip saving files if last sheet is equal to current (default: True)
  -W WORKERS, --workers WORKERS
                        Specify number of workers (processes) to use for scraping, if None equal to cpu_count() (default: None)
  -O OUTPUT, --output OUTPUT
                        Specify rank output file name. It will be formatted with year (.format(year)). (default: data/rank_{}.xlsx)
  --min-pts-output MIN_PTS_OUTPUT
                        Specify min_pts output file name. It will be formatted with year (.format(year)). (default: data/min_pts_{}.xlsx)
  --number-of-contract-output CONTRACTS_SAVE_PATH
                        Specify number_of_contracts output file name. It will be formatted with year (.format(year)). (default:
                        data/contracts_{}.xlsx)
  --trace-output TRACE_OUTPUT
                        Specify trace output file name. It will be formatted with year (.format(year)). To skip trace output use --trace-output "".
                        (default: logs/trace_{}.log)
  --no-backup           Do not backup files before overwriting them (default: True)

Advanced

While you can just call ssm_rank_scraper.py some of you may need to use it differently. In fact, ssm_rank_scraper.py contains several helper functions that help reading credentials, calling grabber.grab, which returns a pd.DataFrame instance with the downloaded ranking, computing minimum points per school, location and type of contract combination and saving the results. ssm_rank_scraper also makes sure to ouput the log in order to keep trace of what is being done and give the user a feedback.

If you need to somehow automate the process of downloading the ranking, import grabber or grabber.grab. Doing so, you will avoid the output that is generated by ssm_rank_scraper.

grabber.grab supports also passing functions as callbacks that are executed in particular situations, such as when a page with no entries is encountered.

Tests

Although I haven't written unit tests for this library, I tested it for years 2018, 2019, 2020, 2021, 2022, 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
credentials_model.json		credentials_model.json
grabber.py		grabber.py
requirements.txt		requirements.txt
ssm_rank_scraper.py		ssm_rank_scraper.py
year_parser.py		year_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ssm-rank-scraper

Description

Features

Usage

Help

Advanced

Tests

About

Releases

Packages

Languages

License

zenodallavalle/ssm-rank-scraper

Folders and files

Latest commit

History

Repository files navigation

ssm-rank-scraper

Description

Features

Usage

Help

Advanced

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages