Disable Cache Update #955

TheAnalystx · 2023-06-13T17:32:14Z

TheAnalystx
Jun 13, 2023

Hi,

I want to do some research which required data in 10 min intervalls for all stations:

request = DwdObservationRequest(
   parameter=[
   DwdObservationDataset.TEMPERATURE_AIR,
   DwdObservationDataset.WIND,
   DwdObservationDataset.PRECIPITATION
],
   resolution=DwdObservationResolution.MINUTE_10,
   start_date=start_date,
   end_date=end_date,
).all()

It takes nearly a day to download the cache, and after some time it seems to reload the files. Since I already know the period I want to have is in the cache I don't need any cache updates. Is there an option for that? I noticed the "cache_disable" but couldn't find any documentation on that.

gutzbenj · 2023-06-13T19:45:16Z

gutzbenj
Jun 13, 2023
Maintainer

Dear @TheAnalystx ,

thanks for writing us!

There's no option to make a cache last unlimited - which would counter the point of a using a cache after all. The only thing you could do is deactivate the cache entirely with the Settings class and pass that class to the request like:

    from wetterdienst import Settings
    from wetterdienst.provider.dwd.observation import DwdObservationRequest
    
    settings = Settings(cache_disable=True)
    request = DwdObservationRequest(
        ...,
        settings=settings
    )

In the case of 10_minutes data we have set the cache lasting to relatively short time because of the update interval of recent data, however I will review the settings and eventually make an update.

However I have another proposal: Could you please try the wetterdienst nightly from github:

    pip install git+https://github.com/earthobservations/wetterdienst

The latest update makes use of polars as replacement for pandas which should tremendously increase the data load speed especially for high resolution data.

8 replies

gutzbenj Jun 13, 2023
Maintainer

I get you, this is because of the extensive data loading process of this high resolution data. I'm pretty sure the new polars-based nightly will help you, at least on my machine it increased loading speed a lot!

TheAnalystx Jun 13, 2023
Author

I just gave it a try and its faster (~ 2x speed), but it still takes more than a minute! to download a new cache file to the cache folder. Unfortunately this is too much for my usecase. I am not sure what takes so much time in the background because if I manually download the files or even with a script it takes around 2 seconds to download a file and load it as dataframe.

See a simple code example:

import requests
import zipfile
import io
import pandas as pd


def download_and_extract_to_df(url):
    # Send a GET request
    response = requests.get(url)

    # If the GET request is successful, the status code will be 200
    if response.status_code == 200:

        # Create a BytesIO object from the content
        zip_file = io.BytesIO(response.content)

        # Open the zip file
        with zipfile.ZipFile(zip_file, 'r') as zf:
            # Get a list of all the names of the files in the zip file
            file_list = zf.namelist()

            # Filter for .txt files
            txt_files = [f for f in file_list if f.endswith('.txt')]

            if txt_files:
                # Open the first .txt file
                with zf.open(txt_files[0], 'r') as f:
                    # Read the .txt file and create a dataframe
                    df = pd.read_csv(f, sep=';', encoding='utf-8')

                return df
            else:
                return "No .txt file found in zipfile."

    else:
        return "Failed to retrieve zipfile."


import time
start_time = time.time()  # start time

# Test
url = r"https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/air_temperature/historical/10minutenwerte_TU_00003_19930428_19991231_hist.zip"
df = download_and_extract_to_df(url)
print(df.head())

print(f"The method took {time.time() - start_time} seconds to execute.")

printed output:

   STATIONS_ID    MESS_DATUM  QN  PP_10  TT_10  TM5_10  RF_10  TD_10
0            3  199304281230   1  987.3   24.9    28.4   23.0    2.4
1            3  199304281240   1  987.2   24.9    28.6   21.0    1.2
2            3  199304281250   1  987.2   25.5    28.7   20.0    0.7
3            3  199304281300   1  987.0   25.8    28.8   20.0    1.0
4            3  199304281310   1  986.9   25.8    29.6   20.0    0.9
The method took 2.040026903152466 seconds to execute.

gutzbenj Jun 13, 2023
Maintainer

Interesting! When I run this

from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, DwdObservationResolution
from tqdm import tqdm
request = DwdObservationRequest(
   parameter=[
   DwdObservationDataset.TEMPERATURE_AIR,
   DwdObservationDataset.WIND,
   DwdObservationDataset.PRECIPITATION
],
   resolution=DwdObservationResolution.MINUTE_10,
).all()


for data in tqdm(request.values.query(), total=request.df.shape[0]):
    pass

after few minutes I get on average something between 4 and 10 s per station and an estimated 2h for the entire station list which would be reasonable in my view because we do som processing & formatting of the data.

TheAnalystx Jun 14, 2023
Author

2h for all data would be reasonable for me too, unfortunately it doesn't seem to work like that for me. I just benchmarked and I now downloaded for 3h the data again. In the meantime I downloaded 449 cache files in 198 minutes, so 2,26 downloads per minute.
There are 1614 .zip files for TEMPERATURE_AIR/10Min/Historical. I requested 3 parameters so ca. 4848 (3*1616) files. 4848/2,26 = 2142 minutes or 35,5 hours. This is to much! Also I don't know if the cache expiry is handled. In worst case the cache of the first file is already expired and is loaded directly again. For my project I need the 10min data between 2017-2021 for all weather stations for all available parameters. Unfortunately this seems currently not possible.

Can I use the dataprocessing pipeline of wetterdienst but applying it to my own custom cache? Because then I would handle the downloading and combining with custom code.

Another question, are you using pd.concat? If so how do you use it? like: download station2 data -> to dataframe -> pd.concat([df1, df2]) or are you collecting all dfs first like in a list of dataframes and then doing pd.concat([df1, df2, ..., dfn]) because the second approach is multiple magnitudes faster than the first one.

Edit:
I just searched the codebase and found this:

In my expirience this is very very slow! Here is a discussion about it. Alternatively you can also collect a list of dfs and do pd.concat on them.

TheAnalystx Jun 14, 2023
Author

On top of that I think it could tremendously boost the speed with a slightly modified caching strategy. For example a lazy chache refresh. The meta information of the data is stored in a cache_meta_index file. When executing a wetterdienst request its first checked if the request already has the data in the cache by looking into the cache_meta_index file. If not, or only partly, only the required data is updated. If there is some interest I could have a look at it and do some prototyping for the implementation. However, I would propably need some guidence on where to begin.

amotl · 2023-06-13T20:54:00Z

amotl
Jun 13, 2023
Maintainer

Hi. If you only want to have a fixed set of historical data, or data until a specific cutoff time, to reside in the cache, and reuse that over and over again, it makes perfect sense to me to optionally configure the cache to always persist, i.e. with an infinite TTL. With kind regards, Andreas.

On 13 June 2023 22:20:37 CEST, TheAnalystx ***@***.***> wrote: Hi @gutzbenj thank you for your fast reply. I will give the nightly version a try, thank you! However I do not understand the argument with the cache. In my oppinion a cache is used to allow for faster data access. If I have to download files for a long time (currently 12h and there are still new files every minute beeing created in the cache) and tomorrow I have to start the whole thing again, even though I know the required data is already downloaded then this beats the idea of a cache or not? For me the expected behaviour would be that files only have to be updated if they do not cover the parameters or start/end time of my request. I thought this is what the "cache" file in the cache folder was for. Example: ``` request = DwdObservationRequest( parameter=[ DwdObservationDataset.TEMPERATURE_AIR, DwdObservationDataset.WIND, DwdObservationDataset.PRECIPITATION ], resolution=DwdObservationResolution.MINUTE_10, start_date=start_date, end_date=end_date, ).all() ``` would write to cache_log file -> Station001, 01.01.2017:00:00:00, 31.12.2021:23:59:59, TEMPERATURE_AIR, THE_FILE_HASH Station002, ...... When I now execute the above command, it sees that the request period is already in the cache, and no download/update is required. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

1 reply

gutzbenj Jun 13, 2023
Maintainer

I'm not really sure about the consequences. Safest thing to do atm. in my opinion would be to review the expiry for historical high resolution data. Might have been that we have set the short expiry across all periods although historical data hardly updates for a year.

amotl · 2023-06-14T10:08:57Z

amotl
Jun 14, 2023
Maintainer

Hi again, your use-case resonates very much with me. If the current cache subsystem cannot be bended into the right way to solve this problem, let's look at different approaches how to improve efficiency in data processing. Exploring a way to work on a mirror of the raw data would also be sensible. With kind regards, Andreas. P.S. I am currently traveling, so my contributions to this discussion are a bit limited to sharing my support for the idea only.

On 14 June 2023 11:51:32 CEST, TheAnalystx ***@***.***> wrote: 2h for all data would be reasonable for me too, unfortunately it doesn't seem to work like that for me. I just benchmarked and I now downloaded for 3h the data again. In the meantime I downloaded 449 cache files in 198 minutes, so 2,26 downloads per minute. There are 1614 .zip files for TEMPERATURE_AIR/10Min/Historical. I requested 3 parameters so ca. 4848 (3*1616) files. 4848/2,26 = 2142 minutes or 35,5 hours. This is to much! Also I don't know if the cache expiry is handled. In worst case the cache of the first file is already expired and is loaded directly again. For my project I need the 10min data between 2017-2021 for all weather stations for all available parameters. Unfortunately this seems currently not possible. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

0 replies

amotl · 2023-06-14T10:27:53Z

amotl
Jun 14, 2023
Maintainer

I think the caching should already work like that, no?

On 14 June 2023 12:16:23 CEST, TheAnalystx ***@***.***> wrote: On top of that I think it could tremendously boost the speed with a slightly modified caching strategy. For example a lazy chache refresh. The meta information of the data is stored in a cache_meta_index file. When executing a wetterdienst request its first checked if the request already has the data in the cache by looking into the cache_meta_index file. If not, or only partly, only the required data is updated. If there is some interest I could have a look at it and do some prototyping for the implementation. However, I would propably need some guidence on where to begin. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

1 reply

TheAnalystx Jun 14, 2023
Author

hmm, could be, maybe I missunderstood the caching process. However if I open a cache file with 7zip there are text files in there. This means there is a overhead when parsing the data each time again, because if you parse text files into pandas, pandas has to read ALL values of the txt/csv file first to determine the dtype of the column and if it allows nan values. The time difference is alot when dealing with bigger data, especially if dealing with multiple files. So if the cache would store pickle files (or other optimized formats such as: Parquet, Feather (Apache Arrow) or HDF5 [all of them are already supported by pandas, e.g. df.to_feather()]) within the zip the loading should be much faster. Other optimization could be to store all data within 1 zip file or batched files or even better a sqllight database managing all the downloaded data and updating only when necessary.

amotl · 2023-06-14T14:36:01Z

amotl
Jun 14, 2023
Maintainer

Hi again, your proposal is very reasonable. Its implementation however may happen orthogonal to the caching subsystem. To be more specific, we are planning to add a data format rewriting subsystem, where data would be stored into Parquet files and published as a Zarr archive. In this way, Wetterdienst itself might not even be required for consuming the data. With kind regards, Andreas.

On 14 June 2023 15:35:38 CEST, TheAnalystx ***@***.***> wrote: hmm, could be, maybe I missunderstood the caching process. However if I open a cache file with 7zip there are text files in there. This means there is a overhead when parsing the data each time again, because if you parse text files into pandas, pandas has to read ALL values of the txt/csv file first to determine the dtype of the column and if it allows nan values. The time difference is alot when dealing with bigger data, especially if dealing with multiple files. So if the cache would store pickle files (or other optimized formats such as: Parquet, Feather (Apache Arrow) or HDF5) within the zip the loading should be much faster. Other optimization could be to store all data within 1 zip file or batched files or even better a sqllight database managing all the downloaded data and updating only when necessary. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

1 reply

TheAnalystx Jun 14, 2023
Author

Oh, nice! But I assume this feature will take some time right? Is there are estimated release date?

amotl · 2023-06-14T15:23:34Z

amotl
Jun 14, 2023
Maintainer

Hi once more,

a sqllight database managing all the downloaded data

This is an approach you can also leverage today, and we know users are actually doing that with Wetterdienst: Just export the data of interest into an SQLite database (or any other), and run your analytic queries on that database. With kind regards, Andreas.

On 14 June 2023 15:35:38 CEST, TheAnalystx ***@***.***> wrote: hmm, could be, maybe I missunderstood the caching process. However if I open a cache file with 7zip there are text files in there. This means there is a overhead when parsing the data each time again, because if you parse text files into pandas, pandas has to read ALL values of the txt/csv file first to determine the dtype of the column and if it allows nan values. The time difference is alot when dealing with bigger data, especially if dealing with multiple files. So if the cache would store pickle files (or other optimized formats such as: Parquet, Feather (Apache Arrow) or HDF5) within the zip the loading should be much faster. Other optimization could be to store all data within 1 zip file or batched files or even better a sqllight database managing all the downloaded data and updating only when necessary. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

1 reply

TheAnalystx Jun 14, 2023
Author

Oh, also nice. Is there a guide for that? I couldn't find any github discussion for that.

Is there a specific reason why the Zarr/Parquet way is preferred, I am not really good at Dev Ops but wouldn't a sqlLight db also make sense?

Sql is very efficient in storing, transforming and retrieving data and it would propably be possible to solve multiple issues easier with a sql db approach, e.g. the following open issues:

#858

#859

it could also be usefull for data storage of future features, one which comes to mind would be the Polygon mesh for a Delaunay Triangulation.

amotl · 2023-06-15T08:56:59Z

amotl
Jun 15, 2023
Maintainer

Hi. You can find the spot by searching for "export" on the documentation. Let us know if that helps already. Cheers, Andreas.

-- https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#data-export

On 14 June 2023 22:00:04 CEST, TheAnalystx ***@***.***> wrote: Oh, also nice. Is there a guide for that? I couldn't find any github discussion for that. -- Reply to this email directly or view it on GitHub: #955 (reply in thread) You are receiving this because you commented. Message ID: ***@***.***>

-- Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.

0 replies

TheAnalystx · 2023-06-18T13:03:12Z

TheAnalystx
Jun 18, 2023
Author

Hi,

I did some testing, since I invested some time into it I wanted to share the findings. Maybe it helps?

I've tested a download log that inventories downloaded files, complete with their metadata. The program I've written first verifies if any updates are necessary before making a request for data. If updates are required, it selectively refreshes those files pertinent to the upcoming request.

In initial testing, I utilized SQLite for quick data retrieval. However, I discovered that the local database size swelled considerably as it wasn't being compressed, leading to a size greater than 20GB.

As a subsequent approach, I downloaded the data files in zip format and stored them as pickle files within the zip archive. This method showed noticeable improvements in loading speed and when compared to the SQLite approach also storage size. Notably, the process was further accelerated when I preloaded all pickle files and subsequently merged them. For monitoring the progress, a progress bar or periodic print statements were highly useful, especially during download phases.

However, I encountered some issues while attempting to load the data into memory (my system has 32GB DDR2 RAM). Initially, I attempted to transform the data into a 'value' and 'variable' format, similar to the 'wetterdienst' structure. This approach, unfortunately, significantly overloaded the memory.

As an alternative, I reverted to the more compact data format provided by DWD, but even that proved to be a strain on the memory. To resolve this, I fine-tuned the dtype of each column to the smallest feasible option and eliminated all meta_info columns. This adjustment not only fit the data into memory but also left enough room for loading additional data and performing merging, data processing, and other operations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable Cache Update #955

{{title}}

Replies: 8 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Disable Cache Update #955

TheAnalystx Jun 13, 2023

Replies: 8 comments · 12 replies

gutzbenj Jun 13, 2023 Maintainer

gutzbenj Jun 13, 2023 Maintainer

TheAnalystx Jun 13, 2023 Author

gutzbenj Jun 13, 2023 Maintainer

TheAnalystx Jun 14, 2023 Author

TheAnalystx Jun 14, 2023 Author

amotl Jun 13, 2023 Maintainer

gutzbenj Jun 13, 2023 Maintainer

amotl Jun 14, 2023 Maintainer

amotl Jun 14, 2023 Maintainer

TheAnalystx Jun 14, 2023 Author

amotl Jun 14, 2023 Maintainer

TheAnalystx Jun 14, 2023 Author

amotl Jun 14, 2023 Maintainer

TheAnalystx Jun 14, 2023 Author

amotl Jun 15, 2023 Maintainer

TheAnalystx Jun 18, 2023 Author

TheAnalystx
Jun 13, 2023

Replies: 8 comments 12 replies

gutzbenj
Jun 13, 2023
Maintainer

gutzbenj Jun 13, 2023
Maintainer

TheAnalystx Jun 13, 2023
Author

gutzbenj Jun 13, 2023
Maintainer

TheAnalystx Jun 14, 2023
Author

TheAnalystx Jun 14, 2023
Author

amotl
Jun 13, 2023
Maintainer

gutzbenj Jun 13, 2023
Maintainer

amotl
Jun 14, 2023
Maintainer

amotl
Jun 14, 2023
Maintainer

TheAnalystx Jun 14, 2023
Author

amotl
Jun 14, 2023
Maintainer

TheAnalystx Jun 14, 2023
Author

amotl
Jun 14, 2023
Maintainer

TheAnalystx Jun 14, 2023
Author

amotl
Jun 15, 2023
Maintainer

TheAnalystx
Jun 18, 2023
Author