Disable Cache Update #955
Replies: 8 comments 12 replies
-
Dear @TheAnalystx , thanks for writing us! There's no option to make a cache last unlimited - which would counter the point of a using a cache after all. The only thing you could do is deactivate the cache entirely with the from wetterdienst import Settings
from wetterdienst.provider.dwd.observation import DwdObservationRequest
settings = Settings(cache_disable=True)
request = DwdObservationRequest(
...,
settings=settings
) In the case of However I have another proposal: Could you please try the wetterdienst nightly from github: pip install git+https://github.com/earthobservations/wetterdienst The latest update makes use of polars as replacement for pandas which should tremendously increase the data load speed especially for high resolution data. |
Beta Was this translation helpful? Give feedback.
-
Hi.
If you only want to have a fixed set of historical data, or data until a specific cutoff time, to reside in the cache, and reuse that over and over again, it makes perfect sense to me to optionally configure the cache to always persist, i.e. with an infinite TTL.
With kind regards,
Andreas.
On 13 June 2023 22:20:37 CEST, TheAnalystx ***@***.***> wrote:
Hi @gutzbenj
thank you for your fast reply.
I will give the nightly version a try, thank you!
However I do not understand the argument with the cache.
In my oppinion a cache is used to allow for faster data access. If I have to download files for a long time (currently 12h and there are still new files every minute beeing created in the cache) and tomorrow I have to start the whole thing again, even though I know the required data is already downloaded then this beats the idea of a cache or not?
For me the expected behaviour would be that files only have to be updated if they do not cover the parameters or start/end time of my request. I thought this is what the "cache" file in the cache folder was for.
Example:
```
request = DwdObservationRequest(
parameter=[
DwdObservationDataset.TEMPERATURE_AIR,
DwdObservationDataset.WIND,
DwdObservationDataset.PRECIPITATION
],
resolution=DwdObservationResolution.MINUTE_10,
start_date=start_date,
end_date=end_date,
).all()
```
would write to cache_log file ->
Station001, 01.01.2017:00:00:00, 31.12.2021:23:59:59, TEMPERATURE_AIR, THE_FILE_HASH
Station002, ......
When I now execute the above command, it sees that the request period is already in the cache, and no download/update is required.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
Hi again,
your use-case resonates very much with me. If the current cache subsystem cannot be bended into the right way to solve this problem, let's look at different approaches how to improve efficiency in data processing.
Exploring a way to work on a mirror of the raw data would also be sensible.
With kind regards,
Andreas.
P.S. I am currently traveling, so my contributions to this discussion are a bit limited to sharing my support for the idea only.
On 14 June 2023 11:51:32 CEST, TheAnalystx ***@***.***> wrote:
2h for all data would be reasonable for me too, unfortunately it doesn't seem to work like that for me. I just benchmarked and I now downloaded for 3h the data again. In the meantime I downloaded 449 cache files in 198 minutes, so 2,26 downloads per minute.
There are 1614 .zip files for TEMPERATURE_AIR/10Min/Historical. I requested 3 parameters so ca. 4848 (3*1616) files. 4848/2,26 = 2142 minutes or 35,5 hours. This is to much! Also I don't know if the cache expiry is handled. In worst case the cache of the first file is already expired and is loaded directly again. For my project I need the 10min data between 2017-2021 for all weather stations for all available parameters. Unfortunately this seems currently not possible.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you commented.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
I think the caching should already work like that, no?
On 14 June 2023 12:16:23 CEST, TheAnalystx ***@***.***> wrote:
On top of that I think it could tremendously boost the speed with a slightly modified caching strategy. For example a lazy chache refresh. The meta information of the data is stored in a cache_meta_index file. When executing a wetterdienst request its first checked if the request already has the data in the cache by looking into the cache_meta_index file. If not, or only partly, only the required data is updated. If there is some interest I could have a look at it and do some prototyping for the implementation. However, I would propably need some guidence on where to begin.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you commented.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
Hi again,
your proposal is very reasonable. Its implementation however may happen orthogonal to the caching subsystem.
To be more specific, we are planning to add a data format rewriting subsystem, where data would be stored into Parquet files and published as a Zarr archive. In this way, Wetterdienst itself might not even be required for consuming the data.
With kind regards,
Andreas.
On 14 June 2023 15:35:38 CEST, TheAnalystx ***@***.***> wrote:
hmm, could be, maybe I missunderstood the caching process. However if I open a cache file with 7zip there are text files in there. This means there is a overhead when parsing the data each time again, because if you parse text files into pandas, pandas has to read ALL values of the txt/csv file first to determine the dtype of the column and if it allows nan values. The time difference is alot when dealing with bigger data, especially if dealing with multiple files. So if the cache would store pickle files (or other optimized formats such as: Parquet, Feather (Apache Arrow) or HDF5) within the zip the loading should be much faster. Other optimization could be to store all data within 1 zip file or batched files or even better a sqllight database managing all the downloaded data and updating only when necessary.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you commented.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
Hi once more,
a sqllight database managing all the downloaded data
This is an approach you can also leverage today, and we know users are actually doing that with Wetterdienst: Just export the data of interest into an SQLite database (or any other), and run your analytic queries on that database.
With kind regards,
Andreas.
On 14 June 2023 15:35:38 CEST, TheAnalystx ***@***.***> wrote:
hmm, could be, maybe I missunderstood the caching process. However if I open a cache file with 7zip there are text files in there. This means there is a overhead when parsing the data each time again, because if you parse text files into pandas, pandas has to read ALL values of the txt/csv file first to determine the dtype of the column and if it allows nan values. The time difference is alot when dealing with bigger data, especially if dealing with multiple files. So if the cache would store pickle files (or other optimized formats such as: Parquet, Feather (Apache Arrow) or HDF5) within the zip the loading should be much faster. Other optimization could be to store all data within 1 zip file or batched files or even better a sqllight database managing all the downloaded data and updating only when necessary.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you commented.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
Hi. You can find the spot by searching for "export" on the documentation. Let us know if that helps already. Cheers, Andreas.
On 14 June 2023 22:00:04 CEST, TheAnalystx ***@***.***> wrote:
Oh, also nice. Is there a guide for that? I couldn't find any github discussion for that.
--
Reply to this email directly or view it on GitHub:
#955 (reply in thread)
You are receiving this because you commented.
Message ID: ***@***.***>
--
Sent from my mind. This might have been typed on a mobile device, so please excuse my brevity.
|
Beta Was this translation helpful? Give feedback.
-
Hi, I did some testing, since I invested some time into it I wanted to share the findings. Maybe it helps? I've tested a download log that inventories downloaded files, complete with their metadata. The program I've written first verifies if any updates are necessary before making a request for data. If updates are required, it selectively refreshes those files pertinent to the upcoming request. In initial testing, I utilized SQLite for quick data retrieval. However, I discovered that the local database size swelled considerably as it wasn't being compressed, leading to a size greater than 20GB. As a subsequent approach, I downloaded the data files in zip format and stored them as pickle files within the zip archive. This method showed noticeable improvements in loading speed and when compared to the SQLite approach also storage size. Notably, the process was further accelerated when I preloaded all pickle files and subsequently merged them. For monitoring the progress, a progress bar or periodic print statements were highly useful, especially during download phases. However, I encountered some issues while attempting to load the data into memory (my system has 32GB DDR2 RAM). Initially, I attempted to transform the data into a 'value' and 'variable' format, similar to the 'wetterdienst' structure. This approach, unfortunately, significantly overloaded the memory. As an alternative, I reverted to the more compact data format provided by DWD, but even that proved to be a strain on the memory. To resolve this, I fine-tuned the dtype of each column to the smallest feasible option and eliminated all meta_info columns. This adjustment not only fit the data into memory but also left enough room for loading additional data and performing merging, data processing, and other operations. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I want to do some research which required data in 10 min intervalls for all stations:
It takes nearly a day to download the cache, and after some time it seems to reload the files. Since I already know the period I want to have is in the cache I don't need any cache updates. Is there an option for that? I noticed the "cache_disable" but couldn't find any documentation on that.
Beta Was this translation helpful? Give feedback.
All reactions