The objective of this project is to read data from a CSV file, perform grouping and sampling operations based on specified attributes, and create a new DataFrame with the sampled data.
- Read data from
Data.csv
into DataFramedf0
. - Group
df0
by the attributes ["Sex", "Age_category", "Highest_education_level"] to obtain the frequency of each combination, storing the result in a dictionaryfreq_of_all_combos
. - Initialise an empty DataFrame
df
with 50,000 rows and columns ["Sex", "Age_category", "Highest_education_level"]. - For each unique combination
combo
infreq_of_all_combos
:- Calculate the proportion of
combo
in the seed sample asproportion = (frequency of the combination / total sample size)
. - Determine the number of agents
n
indf
corresponding tocombo
using the formulan = proportion * 50000
. - Identify the indices in
df
where the "Sex" column is NaN and store them insample_indices
. - If the number of
sample_indices
is greater than or equal ton
:- Randomly select
n
indices fromsample_indices
without replacement. - Assign the values of
combo
to the selected indices indf
.
- Randomly select
- Calculate the proportion of
- Return the
df
DataFrame.