Cleaning-Functions

This notebook contains the class cleaining that has several methods that are useful to clean categorical variables. There is also a Scikit implementation of the methods remove_nulls and group_categories to be used in a pipeline of transformations. The methods are the following:

get_nulls(dataframe, columns): This method returns a dictionary with the percentage of nulls of each columns of a dataframe.

-Inputs:

  - dataframe: a pandas dataframe object.
  - columns: the columns of the dataframe to be included in the calculation. If this is not specified all the 
    columns will be taken into account.

remove_nulls(dataframe, cut_off, columns): This method remove the columns of a dataframe that have a percentage of nulls higher than a certain cut_off percentage of nulls.

-Inputs:

  - dataframe: a pandas dataframe object.
  - cut_off: The minimum percentage of nulls allowed to keep a columns. If a column has a percentage of nulls higher 
    than the cut_off percentage, it will be removed.
  - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
    columns will be taken into account.

fill_nulls(dataframe, label, columns): This method fill the null values of the columns of a dataframe with a desired label.

-Inputs:

  - dataframe: a pandas dataframe object.
  - label: The text that will be used to replace nulls.
  - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
    columns will be taken into account.

group_categories(dataframe, cut_off, label, columns): This method change the category of a categorical variable to a desired label if the percentage of occurence of the category is less than a certain cut_off percentage. This allows to put in the same category those categories with low frequency.

-Inputs:

  - dataframe: a pandas dataframe object.
  - cut_off: Categories with a percentage of occurence less than the cut_off percenatage will be relabeled
  - label: The label for those categories that will be relabeled.
  - columns: the columns of the dataframe to be included in the operation. If this is not specified all the 
    columns will be taken into account.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Cleaning Functions.ipynb		Cleaning Functions.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleaning-Functions

About

Releases

Packages

Languages

prodillo/Cleaning-Functions

Folders and files

Latest commit

History

Repository files navigation

Cleaning-Functions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages