Dataset options for CRAWLEE_STORAGE_DIR and DATASET_ID besides env vars #1520
Replies: 3 comments 1 reply
-
Datasets are not used internally, so this is already possible in your own code, just use multiple dataset instances explicitly. You can even have a custom function that handles this dynamically the way you want it. Could you share some code of what are you doing now? Sounds like you are using the static |
Beta Was this translation helpful? Give feedback.
-
Hi, yea I am just using Dataset.pushData() only. I kind of assumed I could do something like you mentioned but am not sure how to go about that. Do you have any examples of something like that? When you say use multiple instances I assume that means you can set the options for each to output to a different name/path etc? |
Beta Was this translation helpful? Give feedback.
-
Ok thanks for the info. |
Beta Was this translation helpful? Give feedback.
-
It is nice to be able to set env variables for the path and id of a crawl but it does not help in other cases when crawling multiple sites from one crawler.
The motivation for this is due to the fact that I often run broad crawls across many hundreds of sites gathering data for later NLP processing. It would be nice to be able to set the
CRAWLEE_STORAGE_DIR
and theDATASET_ID
dynamically while it is running to dump data to different locations for different sites.Beta Was this translation helpful? Give feedback.
All reactions