import datasets, perform exploratory data analysis, scaling & different models such as linear or logistic regression, decision trees, random forests, K means, support vectors etc.
Import Modules
install module in system :
"pip3 install module-name"
Process Data
process_data.py contains the following functions :
get_file_names_in_dir(dir_name) : print name of files to process in directory
dataset_import(file_name, dataset_type) : import dataset & print description such as data size, rows, columns, unique and null values
dataset_EDA(data, pairplot_columns) : pairplot, heatmap
dataset_scrubbing(data, scrub_type, data_columns, fill_operation) : clean data by removing or filling missing values, deal with categorical variables using one hot encoding, remove entire columns
pre_model_algorithm(df, algorithm, target_column) : scale data using principle component analysis or k means clustering
def split_validation(dataset, features, target_column, test_split) : split train data into train & test including the target column with desired split ratio
Run Model
run_model.py contains the following models :
linear_regression(X_train, X_test, y_train, y_test, show_columns, target_column) : continuous predictions
logistic_regression(X_train, X_test, y_train, y_test, show_columns, target_column) : discrete predictions
decision_tree_classifier(X_train, X_test, y_train, y_test, show_columns, target_column) : both continuous & discrete predictions
random_forest_classifier(X_train, X_test, y_train, y_test, show_columns, target_column, num_estimators) : both continuous & discrete predictions
gradient_boosting(X_train, X_test, y_train, y_test, show_columns, target_column, gb_type) : regressor for continuous & classifier for discrete
k_neighbors_classifier(X_train, X_test, y_train, y_test, show_columns, target_column, k, scaled_features) : continuous, discrete, ordinal, categorical data predictions
support_vector_classifier(X_train, X_test, y_train, y_test, show_columns, target_column) : continuous data predictions
References