This notebook contains the class cleaining that has several methods that are useful to clean categorical variables. There is also a Scikit implementation of the methods remove_nulls and group_categories to be used in a pipeline of transformations. The methods are the following:
-
get_nulls(dataframe, columns): This method returns a dictionary with the percentage of nulls of each columns of a dataframe.
-Inputs:
- dataframe: a pandas dataframe object. - columns: the columns of the dataframe to be included in the calculation. If this is not specified all the columns will be taken into account.
-
remove_nulls(dataframe, cut_off, columns): This method remove the columns of a dataframe that have a percentage of nulls higher than a certain cut_off percentage of nulls.
-Inputs:
- dataframe: a pandas dataframe object. - cut_off: The minimum percentage of nulls allowed to keep a columns. If a column has a percentage of nulls higher than the cut_off percentage, it will be removed. - columns: the columns of the dataframe to be included in the operation. If this is not specified all the columns will be taken into account.
-
fill_nulls(dataframe, label, columns): This method fill the null values of the columns of a dataframe with a desired label.
-Inputs:
- dataframe: a pandas dataframe object. - label: The text that will be used to replace nulls. - columns: the columns of the dataframe to be included in the operation. If this is not specified all the columns will be taken into account.
-
group_categories(dataframe, cut_off, label, columns): This method change the category of a categorical variable to a desired label if the percentage of occurence of the category is less than a certain cut_off percentage. This allows to put in the same category those categories with low frequency.
-Inputs:
- dataframe: a pandas dataframe object. - cut_off: Categories with a percentage of occurence less than the cut_off percenatage will be relabeled - label: The label for those categories that will be relabeled. - columns: the columns of the dataframe to be included in the operation. If this is not specified all the columns will be taken into account.