This package is designed to bridge the gap between data and analysis for NCAA D1 basketball. CBBpy can grab play-by-play, boxscore, and other game metadata for any NCAA D1 men's or women's basketball game. Inspired by the ncaahoopR package by Luke Benz - check that out if you are an R user!
CBBpy requires Python >= 3.9 as well as the following packages:
- pandas>=2.0.0
- numpy>=2.0.0
- python-dateutil>=2.4.0
- pytz>=2022.1
- tqdm>=4.63.0
- lxml>=4.9.0
- joblib>=1.0.0
- beautifulsoup4>=4.11.0
- requests>=2.27.0
- rapidfuzz>=3.9.0
- platformdirs>=4.0.0
Install using pip:
pip install cbbpy
Or upgrade an existing installation:
pip install --upgrade cbbpy
The men's and women's scrapers can be imported as such:
import cbbpy.mens_scraper as s
import cbbpy.womens_scraper as s
NOTE: game ID, as far as CBBpy is concerned, is a valid ESPN game ID
s.get_game_info(game_id: Union[str, int])
grabs all the metadata (game date, time, score, teams, referees, etc) for a particular game.
s.get_game_boxscore(game_id: Union[str, int])
returns a pandas DataFrame with each player's stats for a particular game.
s.get_game_pbp(game_id: Union[str, int])
scrapes the play-by-play tables for a game and returns a pandas DataFrame, with each entry representing a play made during the game.
s.get_game(game_id: Union[str, int], info: bool = True, box: bool = True, pbp: bool = True)
gets all information about a game (game info, boxscore, PBP) and returns a tuple of results (game_info, boxscore, pbp)
. info, box, pbp
are booleans which users can set to False
if there is any information they wish not to scrape. For example, box = False
would return an empty DataFrame for the boxscore info, while scraping PBP and metadata info normally.
s.get_games_season(season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True)
scrapes all game information for all finished or in progress games in a particular season (defaults to the current season). As an example, to scrape games for the 2020-21 season, call get_games_season(2021)
. Returns a tuple of 3 DataFrames, similar to get_game
. See get_game
for an explanation of booleans info, box, pbp
.
s.get_games_range(start_date: Union[int, datetime], end_date: Union[int, datetime], info: bool = True, box: bool = True, pbp: bool = True)
scrapes all game information for all finished or in progress games between start_date
and end_date
(inclusive). As an example, to scrape games from November 30, 2022 to December 10, 2022, call get_games_season('11-30-2022', '12-10-2022')
. Returns a tuple of 3 DataFrames, similar to get_game
. See get_game
for an explanation of booleans info, box, pbp
.
s.get_games_team(team: str, season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True)
scrapes all game information for all finished or in progress games in a particular season (defaults to the current season) for a given team. As an example, to scrape games for Duke's 2020-21 season, call get_games_team('duke', 2021)
; for their current season, you can just call get_games_team('duke')
. If a given team does not have an exact match in the static list of teams scraped from ESPN's site, this function will scrape the games for the closest fuzzy-matched team (e.g. if "valpo" is provided as the team, the function will scrape the games for "Valparaiso"). Returns a tuple of 3 DataFrames, similar to get_game
. See get_game
for an explanation of booleans info, box, pbp
.
s.get_games_conference(conference: str, season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True)
scrapes all game information for all finished or in progress games in a particular season (defaults to the current season) for all teams in a given conference. As an example, to scrape games for the A10's 2017-18 season, call get_games_conference('a10', 2018)
; for their current season, you can just call get_games_conference('a10')
. If a given conference does not have an exact match in the static list of conferences scraped from ESPN's site, this function will scrape the games for the closest fuzzy-matched conference (e.g. if "am east" is provided as the conference, the function will scrape the games for "America East Conference"). Returns a tuple of 3 DataFrames, similar to get_game
. See get_game
for an explanation of booleans info, box, pbp
.
s.get_game_ids(date: Union[str, datetime])
returns a list of all game IDs for a particular date.
s.get_player_info(player_id: Union[str, int])
returns a DataFrame describing the player's info from ESPN's bio page.
s.get_teams_from_conference(conference: str, season: Union[str, int] = None)
returns a list of the teams in the given conference for a season (defaults to the current season).
s.get_team_schedule(team: str, season: Union[str, int] = None)
returns a DataFrame of a team's schedule for a given season (defaults to the current season). If a given team does not have an exact match in the static list of teams scraped from ESPN's site, this function will scrape the schedule for the closest fuzzy-matched team (e.g. if "valpo" is provided as the team, the function will scrape the schedule for "Valparaiso").
s.get_conference_schedule(conference: str, season: Union[str, int] = None)
returns a DataFrame of the schedules for all teams in a given conference for a given season (defaults to the current season). If a given conference does not have an exact match in the static list of conferences scraped from ESPN's site, this function will scrape the schedules for the closest fuzzy-matched conference (e.g. if "am east" is provided as the conference, the function will scrape the schedules for "America East Conference").
Function call:
import cbbpy.mens_scraper as s
s.get_game_info('401522202')
Returns:
game_id | home_team | home_id | home_rank | home_record | home_score | away_team | away_id | away_rank | away_record | away_score | home_win | num_ots | is_conference | is_neutral | is_postseason | tournament | game_day | game_time | game_loc | arena | arena_capacity | attendance | tv_network | referee_1 | referee_2 | referee_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 401522202 | UConn Huskies | 41 | 4 | 31-8 | 76 | San Diego State Aztecs | 21 | 5 | 32-7 | 59 | True | 0 | False | True | True | Men's Basketball Championship - National Championship | April 03, 2023 | 06:20 PM PDT | Houston, TX | NRG Stadium | 0 | 72423 | CBS | Ron Groover | Terry Oglesby | Keith Kimble |
Function call:
import cbbpy.womens_scraper as s
s.get_game_boxscore('401528028')
Returns (partially):
game_id | team | player | player_id | position | starter | min | fgm | fga | 2pm | 2pa | 3pm | 3pa | ftm | fta | oreb | dreb | reb | ast | stl | blk | to | pf | pts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 401528028 | LSU Tigers | A. Reese | 4433402 | F | True | 29 | 5 | 12 | 5 | 12 | 0 | 0 | 5 | 8 | 6 | 4 | 10 | 5 | 3 | 1 | 0 | 3 | 15 |
1 | 401528028 | LSU Tigers | L. Williams | 4280886 | F | True | 37 | 9 | 16 | 9 | 16 | 0 | 0 | 2 | 2 | 1 | 4 | 5 | 0 | 3 | 0 | 3 | 4 | 20 |
2 | 401528028 | LSU Tigers | F. Johnson | 4698736 | G | True | 37 | 4 | 11 | 3 | 7 | 1 | 4 | 1 | 1 | 2 | 5 | 7 | 4 | 1 | 0 | 4 | 1 | 10 |
3 | 401528028 | LSU Tigers | K. Poole | 4433418 | G | True | 24 | 2 | 3 | 0 | 1 | 2 | 2 | 0 | 2 | 0 | 3 | 3 | 1 | 0 | 1 | 1 | 2 | 6 |
4 | 401528028 | LSU Tigers | A. Morris | 4281251 | G | True | 33 | 8 | 14 | 7 | 11 | 1 | 3 | 4 | 4 | 1 | 1 | 2 | 9 | 1 | 0 | 2 | 3 | 21 |
Function call:
import cbbpy.mens_scraper as s
s.get_game_pbp('401522202')
Returns (partially):
game_id | home_team | away_team | play_desc | home_score | away_score | half | secs_left_half | secs_left_reg | play_team | play_type | shooting_play | scoring_play | is_three | shooter | is_assisted | assist_player | shot_x | shot_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 401522202 | UConn Huskies | San Diego State Aztecs | Jump Ball won by UConn | 0 | 0 | 1 | 1200 | 2400 | UConn Huskies | jump ball | False | False | False | False | nan | nan | ||
1 | 401522202 | UConn Huskies | San Diego State Aztecs | Jordan Hawkins made Jumper. Assisted by Adama Sanogo. | 2 | 0 | 1 | 1174 | 2374 | UConn Huskies | jumper | True | True | False | Jordan Hawkins | True | Adama Sanogo | 18 | 15 |
2 | 401522202 | UConn Huskies | San Diego State Aztecs | Lamont Butler made Three Point Jumper. Assisted by Matt Bradley. | 2 | 3 | 1 | 1152 | 2352 | San Diego State Aztecs | three point jumper | True | True | True | Lamont Butler | True | Matt Bradley | 39 | 22 |
3 | 401522202 | UConn Huskies | San Diego State Aztecs | Tristen Newton Turnover. | 2 | 3 | 1 | 1130 | 2330 | UConn Huskies | turnover | False | False | False | False | nan | nan | ||
4 | 401522202 | UConn Huskies | San Diego State Aztecs | Darrion Trammell made Three Point Jumper. Assisted by Keshad Johnson. | 2 | 6 | 1 | 1108 | 2308 | San Diego State Aztecs | three point jumper | True | True | True | Darrion Trammell | True | Keshad Johnson | 1 | 0 |
Function call:
import cbbpy.mens_scraper as s
s.get_player_info('5105865')
Returns:
player_id | first_name | last_name | jersey_number | pos | status | team | experience | height | weight | birthplace | date_of_birth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5105865 | Reed | Bailey | 1 | Forward | active | Davidson Wildcats | Junior | 6' 10" | 230 lbs | Harvard, MA |
Function call:
import cbbpy.womens_scraper as s
s.get_team_schedule('davidson', 2022)
Returns (partially):
team | team_id | season | game_id | game_day | game_time | opponent | opponent_id | season_type | game_status | tv_network | game_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Davidson | 2166 | 2022 | 401370995 | November 09, 2021 | 04:00 PM PST | Delaware Blue Hens | 48 | Regular Season | Final | ESPN+ | W 93-71 |
1 | Davidson | 2166 | 2022 | 401370996 | November 13, 2021 | 05:30 PM PST | San Francisco Dons | 2539 | Regular Season | Final | L 60-65 | |
2 | Davidson | 2166 | 2022 | 401365883 | November 18, 2021 | 09:00 AM PST | New Mexico State Aggies | 166 | Regular Season | Final | ESPNU | L 64-75 |
3 | Davidson | 2166 | 2022 | 401377036 | November 19, 2021 | 11:30 AM PST | Pennsylvania Quakers | 219 | Regular Season | Final | ESPNU | W 72-60 |
4 | Davidson | 2166 | 2022 | 401377040 | November 21, 2021 | 03:00 PM PST | East Carolina Pirates | 151 | Regular Season | Final | ESPNU | W 76-67 |
Function call:
import cbbpy.mens_scraper as s
s.get_conference_schedule('ovc', 2015)
Returns (showing the middle of the output):
team | team_id | season | game_id | game_day | game_time | opponent | opponent_id | season_type | game_status | tv_network | game_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | Belmont | 2057 | 2015 | 400766521 | March 06, 2015 | 07:15 PM PST | Eastern Kentucky Colonels | 2198 | Regular Season | Final | ESPNU | W 53-52 |
31 | Belmont | 2057 | 2015 | 400766705 | March 07, 2015 | 04:00 PM PST | Murray State Racers | 93 | Regular Season | Final | ESPN2 | W 88-87 |
32 | Belmont | 2057 | 2015 | 400785349 | March 20, 2015 | 12:30 PM PDT | Virginia Cavaliers | 258 | Postseason | Final | truTV | L 67-79 |
33 | Eastern Kentucky | 2198 | 2015 | 400596308 | November 14, 2014 | 04:00 PM PST | Savannah State Tigers | 2542 | Regular Season | Final | W 76-53 | |
34 | Eastern Kentucky | 2198 | 2015 | 400596315 | November 18, 2014 | 04:00 PM PST | Kentucky Christian Knights | 3077 | Regular Season | Final | W 115-35 |
Feel free to reach out to me directly with any questions, requests, or suggestions at dnlcowan37@gmail.com.