Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statcast throwing KeyError on certain dates in 2023 #375

Open
Haman-Karn opened this issue Aug 16, 2023 · 4 comments
Open

statcast throwing KeyError on certain dates in 2023 #375

Haman-Karn opened this issue Aug 16, 2023 · 4 comments

Comments

@Haman-Karn
Copy link

While getting all of the statcast data, I kept getting an error around 98%. So I eventually was able to narrow it down to 2023-06-25 being the first problematic one day. Other day(s) past this one also cause the error, but I've stopped at 06-25 because this amount of data is good enough for my current purposes.

The code I'm executing is this:
stats = statcast(start_dt="2023-06-25")

Upon execution, my terminal looks like this:

This is a large query, it may take a moment to complete
  0%|                                                                                                                                   | 0/1 [00:00<?, ?it/s] 
Traceback (most recent call last):
  File "c:\Users\nosoa\Documents\glb\getstats.py", line 6, in <module>
    stats = statcast(start_dt="2023-06-25")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 113, in statcast
    return _handle_request(start_dt_date, end_dt_date, 1, verbose=verbose,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 76, in _handle_request
    dataframe_list.append(future.result())
                          ^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 58, in _cached
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 31, in _small_request
    data = data.sort_values(
           ^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'game_date'
@ss77995ss
Copy link

I have tested the stats = statcast(start_dt="2023-06-25") code on both Colab (python 3.10.12) and my local environment (3.11.2) and they worked fine.

I guess maybe something went wrong in concurrent mode according to the error message dataframe_list.append(future.result())

Maybe try to turn off the parallel will work?

stats = statcast(start_dt="2023-06-25", parallel=False)

@Haman-Karn
Copy link
Author

I discovered the issue -- there must have been something corrupted in the cache. Disabling the cache fixed the problem. But attempting to purge the cache also results in an error.

Traceback (most recent call last):
  File "c:\Users\nosoa\Documents\glb\getstats.py", line 5, in <module>
    pybaseball.cache.purge()
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in purge
    records = [cache_record.CacheRecord(filename) for filename in record_files]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in <listcomp>
    records = [cache_record.CacheRecord(filename) for filename in record_files]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache_record.py", line 23, in __init__
    self.data = cast(Dict[str, Any], file_utils.load_json(filename))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\file_utils.py", line 28, in load_json
    return cast(JSONData, json.load(json_file))
                          ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 37 (char 36)

@ss77995ss
Copy link

I found that some cache file not save completely that cause cache.purge() cannot parse them.
In my case, file name with prefix _small_request all only contain

{"func": "_small_request", "args": [

Because it is not valid json so it will raise decode error.

You can find the cache files from /Users/{user_name}/.pybaseball/cache or in colab /root/.pybaseball/cache

IMO, currently we can only delete those invalid cache file manually since they also do not contain expire time

@ss77995ss
Copy link

Should be fixed in #438

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants