Running data generation in a python function #232
-
First of all a big thanks for creating this library. As part of generating data to make it "realistic", I ended up having to define the stats for data generation at a So, tried to wrap the daily data generation code into a function and call it for every day (every date) for the last 2 years. That failed with the error that seems suggest that At present, I have a notebook with logic to generate daily data, calling this notebook using python multithreading, but now I am defeating the purpose of Spark and its abilities to parallelize and looks like I am primarily using the Wanted to check if there are any design patterns/recommendation for my usecase. Appreciate your feedback/suggestions. Pankaj |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Thanks for your question Pankaj Does the data for each date depend on the previous dates data ? Rather than looping on the driver side, why not make the function a spark UDF or pandas UDF (pandas is more efficient) that generates the data given some input. UDFs and Pandas UDFs can take multiple inputs. To generate multiple outputs, you could simply generate a single JSON valued field and then extract out the elements to fields The following documentation describes generating Pandas UDFs The following documentation describes extracting individual JSON fields into separate columns: https://docs.databricks.com/en/optimizations/semi-structured.html |
Beta Was this translation helpful? Give feedback.
Thanks for your question Pankaj
Does the data for each date depend on the previous dates data ? Rather than looping on the driver side, why not make the function a spark UDF or pandas UDF (pandas is more efficient) that generates the data given some input.
UDFs and Pandas UDFs can take multiple inputs. To generate multiple outputs, you could simply generate a single JSON valued field and then extract out the elements to fields
The following documentation describes generating Pandas UDFs
The following documentation describes extracting individual JSON fields into separate columns:
https://docs.databricks.com/en/optimizations/semi-structured.…