Running data generation in a python function #232

pankajshrestha · 2023-09-25T19:32:09Z

pankajshrestha
Sep 25, 2023

First of all a big thanks for creating this library.

As part of generating data to make it "realistic", I ended up having to define the stats for data generation at a day level and I am generating data for 2 years. Have not figured out how I can generate data for whole 2 years while meeting the business logic that the data needed to comply with without looping.

So, tried to wrap the daily data generation code into a function and call it for every day (every date) for the last 2 years. That failed with the error that seems suggest that spark could only be used from driver node.

At present, I have a notebook with logic to generate daily data, calling this notebook using python multithreading, but now I am defeating the purpose of Spark and its abilities to parallelize and looks like I am primarily using the driver node and not using worker nodes.

Wanted to check if there are any design patterns/recommendation for my usecase.

Appreciate your feedback/suggestions.

Pankaj

Answered by ronanstokes-db

Oct 17, 2023

Thanks for your question Pankaj

Does the data for each date depend on the previous dates data ? Rather than looping on the driver side, why not make the function a spark UDF or pandas UDF (pandas is more efficient) that generates the data given some input.

UDFs and Pandas UDFs can take multiple inputs. To generate multiple outputs, you could simply generate a single JSON valued field and then extract out the elements to fields

The following documentation describes generating Pandas UDFs

https://docs.databricks.com/en/udf/pandas.html

The following documentation describes extracting individual JSON fields into separate columns:

https://docs.databricks.com/en/optimizations/semi-structured.…

View full answer

ronanstokes-db · 2023-10-17T04:01:05Z

ronanstokes-db
Oct 17, 2023
Maintainer

Thanks for your question Pankaj

Does the data for each date depend on the previous dates data ? Rather than looping on the driver side, why not make the function a spark UDF or pandas UDF (pandas is more efficient) that generates the data given some input.

UDFs and Pandas UDFs can take multiple inputs. To generate multiple outputs, you could simply generate a single JSON valued field and then extract out the elements to fields

The following documentation describes generating Pandas UDFs

https://docs.databricks.com/en/udf/pandas.html

The following documentation describes extracting individual JSON fields into separate columns:

https://docs.databricks.com/en/optimizations/semi-structured.html

6 replies

ronanstokes-db Oct 17, 2023
Maintainer

An alternative mechanism is to generate the raw data that underlies the statistics with a date field. Generate the data with build() and use the regular dataframe group by on the date to produce the aggregations (or create a temporary view and use SQL to produce the stats)

Remember once you run build(), the result is a regular Spark SQL dataframe and any Spark SQL dataframe constructs can be used on the result.

pankajshrestha Oct 17, 2023
Author

Hello Ronan,

Thanks for your response. Data for each day do not have dependency on another day and thus can be generated in parallel. I did go the route of spark UDF but was not able to make much progress due to some spark error that I don't have handy.

I will give Pandas UDF a go, wanted to confirm that the dbldatagen.spec.build() is going to be part of the Pandas UDF, correct? And Spark should be able to run these in parallel in multiple tasks/nodes like any other spark code?

If you happen to have a rudimentary example with dbldatagen inside Pandas UDF being called, I will have a working example to run with, if you don't, no worries.

Again, really appreciate you responding back and giving me some hope :) .

Best
Pankaj

ronanstokes-db Oct 20, 2023
Maintainer

Hi Pankaj

I didnt mean to imply that the a separate data generation build would be invoked inside of each pandas UDF - but that you could use a pandas UDF to generate summary statistics for each date.

Can you provide more details on what you are trying to do and i will try and create an example
Regards
Ronan Stokes, Databricks

pankajshrestha Oct 20, 2023
Author

Ronan,will get back to you on Monday with the details.

Thanks
Pankaj

pankajshrestha Oct 24, 2023
Author

Hello Ronan,

At a high level I am creating data for a dimensional model used for analyzing Email Performance based on different types of Marketing campaigns.

We have a fact table email_events that stores the email event level data with different email status like sent, opened, clicked associated with the dimensions like user, campaign, email_template, contact. I am generating data for past two historical years.

The first level of analysis (dashboards) are created at month level with ability to drill down to day level data. To be able to create realistic data which made business sense (e.g. you will have less opened emails than those sent, and even less clicked than those opened etc.) at the lowest level (day level), I ended up being able to influence the data generation I wanted at day level for past 2 years.

So, right now I am generating data for each day of the past 2 years, running a notebook with following data generation script, but running them in parallel using python multi threading passing in all the variables that this notebook needs. The value for the variables for each day for the past two years are calculated in advance and thus do not have dependency on each other.

Following is my main data generation code, where daily_date_start_time for e.g. would be 2023-10-24:00:00:00 and daily_date_end_time would be 2023-10-24 23:59:59 for the instance of the notebook that is responsible for generating data for 2023-10-24.

    dg.DataGenerator(sparkSession=spark, 
                    name='email_events', 
                    rows=loop_daily_email_sent,
                    partitions=partitions_requested
                    )
    .withColumn('organization_id', 'int', minValue=ORG_MIN_VALUE, uniqueValues=UNIQUE_ORGS, random=True)
    .withColumn('email_id', 'bigint', minValue=min_email_value, uniqueValues=loop_daily_email_sent)        
    .withColumn('user_id', 'bigint', minValue=USER_MIN_VALUE, uniqueValues=UNIQUE_USERS, random=True)

    .withColumn('campaign_id_generated', 'bigint', minValue=min_campaign_id_sent, uniqueValues=loop_daily_active_campaigns_sent, random=True)
    .withColumn('campaign_id_hardcode', 'bigint', values=[1,2,3,4,5], weights=[50, 40, 20, 10, 5], random=True, percentNulls=0.30)
    .withColumn("campaign_id", 'bigint', expr="case when campaign_id_hardcode IS NULl then campaign_id_generated else campaign_id_hardcode end", baseColumn=["campaign_id_hardcode", "campaign_id_generated"])
    
    .withColumn('email_template_id', 'bigint', minValue=EMAIL_TEMPLATE_MIN_VALUE, uniqueValues=UNIQUE_EMAIL_TEMPLATES, random=True)
    .withColumn('contact_id', 'int', minValue=min_contact_id_sent_on_email_sent, uniqueValues=loop_daily_contacts_on_email_sent, random=True)        
    .withColumn('event_source', 'string', values=['sendgrid_webhook','event_source','email_delivery_handler','customer_preferences_center'], random=True)
    .withColumn('email_type', 'string', values=['Type1','Type2','Type3','Type4'], weights=[25, 25, 25, 25], random=True)
    .withColumn('event_type', 'string', values=['Sent'])
    .withColumn('event_type_opened', 'string', values=['Opened'], percentNulls=(1-pct_open_over_sent))        
    .withColumn('occurred_at', 'timestamp', begin=daily_date_start_time, end=daily_date_end_time, interval="20 second", random=True)    
    .withColumn('occurred_on', 'date', expr='to_date(occurred_at)', baseColumns=['occurred_at'])    
    .withColumn('is_sent', 'int', minValue=1, maxValue=1)
    
    .withColumn("open_email_delay", "int", minValue=DAYS_BETWN_EMAIL_SENT_OPENED_LOWER, maxValue=DAYS_BETWN_EMAIL_SENT_OPENED_UPPER, random=True, omit=False)

    .withColumn("email_open_at_original", "timestamp", expr="date_add(occurred_at, open_email_delay)", baseColumn=["occurred_at", "open_email_delay"])
    .withColumn('email_open_at_date_time', 'timestamp', values=email_open_time_list, weights=email_open_weight_list, random=True)    
    .withColumn('email_open_at_date', 'date', expr='to_date(email_open_at_original)', baseColumns=['email_open_at_original'])
    .withColumn('email_open_at', 'timestamp', expr="to_timestamp(concat(email_open_at_date,' ',lpad(extract(HOUR FROM email_open_at_date_time),2,'0'),':00',':00'))", baseColumns=['email_open_at_date', 'email_open_at_date_time']) 

    .withColumn('event_type_clicked', 'string', expr="case when event_type_opened IS NOT NULL then 'Clicked' else NULL end", percentNulls=(1-pct_clicked_over_opened))
    
    .withColumn("click_email_delay", "int", minValue=DAYS_BETWN_EMAIL_OPENED_CLICKED_LOWER, maxValue=DAYS_BETWN_EMAIL_OPENED_CLICKED_UPPER, random=True, omit=False)

    .withColumn("email_clicked_at", "timestamp", expr="date_add(email_open_at, click_email_delay)", baseColumn=["email_open_at", "click_email_delay"])
    )


df_email_events_combined = email_events_spec.build()

Please let me know if you need any additional info on this.

Thanks
Pankaj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running data generation in a python function #232

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running data generation in a python function #232

pankajshrestha Sep 25, 2023

Replies: 1 comment · 6 replies

ronanstokes-db Oct 17, 2023 Maintainer

ronanstokes-db Oct 17, 2023 Maintainer

pankajshrestha Oct 17, 2023 Author

ronanstokes-db Oct 20, 2023 Maintainer

pankajshrestha Oct 20, 2023 Author

pankajshrestha Oct 24, 2023 Author

pankajshrestha
Sep 25, 2023

Replies: 1 comment 6 replies

ronanstokes-db
Oct 17, 2023
Maintainer

ronanstokes-db Oct 17, 2023
Maintainer

pankajshrestha Oct 17, 2023
Author

ronanstokes-db Oct 20, 2023
Maintainer

pankajshrestha Oct 20, 2023
Author

pankajshrestha Oct 24, 2023
Author