These are my notes from working through the book Learning Predictive Analytics with Python by Ashish Kumar and published on Feb 2016.
###Chapter 1: Getting Started with Predictive Modelling
- Installed Anaconda Package.
- Python3.5 has been installed.
- Book follows python2, so some codes is modified along the way for python3.
###Chapter 2: Data Cleaning
- Reading the data: variations and examples
- Data frames and delimiters.
####Case 1: Reading a dataset using the read_csv method
- File: titanicReadCSV.py
- File: titanicReadCSV1.py
- File: readCustomerChurn.py
- File: readCustomerChurn2.py
- File: changeDelimiter.py
####Case 2: Reading a dataset using the open method of Python
- File: readDatasetByOpenMethod.py
####Case 3: Reading data from a URL
- Modified the code that it works and prints out line by line dictionary of the dataset.
- File: readURLLib2Iris.py
- File: readURLMedals.py
####Case 4: Miscellaneous cases
- File: readXLS.py
- Created the file above to read from both .xls an .xlsx
####Basics: Summary, dimensions, and structure
- File: basicDataCheck.py
- Created the file above to read from both .xls an .xlsx
####Handling missing values
- File: basicDataCheck.py
- RE: Treating missing data like NaN or None
- Deletion orr imputaion
####Creating dummy variables
- File: basicDataCheck.py
- Split into new variable 'sex_female' and 'sex_male'
- Remove column 'sex'
- Add both dummy column created above.
####Visualizing a dataset by basic plotting
- File: plotData.py
- Figure file: ScatterPlots.jpeg
- Plot Types: Scatterplot, Histograms and boxplots
###Chapter 3: Data Wrangling ####Subsetting a dataset
- Selecting Columns
- File: subsetDataset.py
- Selecting Rows
- File: subsetDatasetRows.py
- Selecting a combination of rows and columns
- File: subsetColRows.py
- Creating new columns
- File: subsetNewCol.py
####Generating random numbers and their usage
- Various methods for generating random numbers
- File: generateRandomNumbers.py
- Seeding a random number
- File: generateRandomNumbers.py
- Generating random numbers following probability distributions
- File: generateRandomProbDistr.py
- Probability density function: PDF = Prob(X=x)
- Cumulative density function: CDF(x) = Prob(X<=x)
- Uniform distribution: random variables occur with the same (uniform) frequency/probability
- Normal distribution: Bell Curve and most ubiquitous and versatile probability distribution
- Using the Monte-Carlo simulation to find the value of pi
- File: calcPi.py
- Geometry and mathematics behind the calculation of pi
- Generating a dummy data frame
- File: generateDummyDataFrame.py
####Grouping the data – aggregation, filtering, and transformation
- File: groupData.py
- Grouping
- Aggregation
- Filtering
- Transformation
- Miscellaneous operations
####Random sampling – splitting a dataset in training and testing datasets
- File: splitDataTrainTest.py
- Method 1: using the Customer Churn Model
- Method 2: using sklearn
- Method 3: using the shuffle function
####Concatenating and appending data
- File: concatenateAndAppend.py
- File: appendManyFiles.py
####Merging/joining datasets
- File: mergeJoin.py
- Inner Join
- Left Join
- Right Join
- An example of the Inner Join
- An example of the Left Join
- An example of the Right Join
- Summary of Joins in terms of their length
###Chapter 4: Statistical Concepts for Predictive Modelling ####Random sampling and central limit theorem ####Hypothesis testing
- Null versus alternate hypothesis
- Z-statistic and t-statistic
- Confidence intervals, significance levels, and p-values
- Different kinds of hypothesis test
- A step-by-step guide to do a hypothesis test
- An example of a hypothesis test
####Chi-square testing ####Correlation
- File: linearRegression.py
- File: linearRegressionFunction.py
- Picture: TVSalesCorrelationPlot.png
- Picture: RadioSalesCorrelationPlot.png
- Picture: NewspaperSalesCorrelationPlot.png
###Chapter 5: Linear Regression with Python ####Understanding the maths behind linear regression
- Linear regression using simulated data
- File: linearRegression.py
- Picture: CurrentVsPredicted1.png
- Picture: CurrentVsPredictedVsMean1.png
- Picture: CurrentVsPredictedVsModel1.png
####Making sense of result parameters
- File: linearRegression.py
- p-values
- F-statistics
- Residual Standard Error (RSE)
####Implementing linear regression with Python
- File: linearRegressionSMF.py
- Linear regression using the statsmodel library
- Multiple linear regression
- Multi-collinearity: sub-optimal performance of the model
- Variance Inflation Factor
- It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables.
####Model validation
- Training and testing data split
- File: linearRegressionSMF.py
- Linear regression with scikit-learn
- File: linearRegressionSKL.py
- Feature selection with scikit-learn
- Recursive Feature Elimination (RFE)
- File: linearRegressionRFE.py
####Handling other issues in linear regression
- Handling categorical variables
- File: linearRegressionECom.py
- Transforming a variable to fit non-linear relations
- File: nonlinearRegression.py
- Picture: MPGVSHorsepower.png
- Picture: MPGVSHorsepowerVsLine.png
- Picture: MPGVSHorsepowerModels.png
- Handling outliers
- Other considerations and assumptions for linear regression
###Chapter 6: Logistic Regression with Python ####Linear regression versus logistic regression ####Understanding the math behind logistic regression
- File: logisticRegression.py
- Contingency tables
- Conditional probability
- Odds ratio
- Moving on to logistic regression from linear regression
- Estimation using the Maximum Likelihood Method
- Building the logistic regression model from scratch
- File: logisticRegressionScratch.py
- Read above again.
- Making sense of logistic regression parameters
- Wald test
- Likelihood Ratio Test statistic
- Chi-square test
- [x]
####Implementing logistic regression with Python
- File: logisticRegressionImplementation.py
- Processing the data
- Data exploration
- Data visualization
- Creating dummy variables for categorical variables
- Feature selection
- Implementing the model
####Model validation and evaluation
- File: logisticRegressionImplementation.py
- Cross validation
####Model validation
- File: logisticRegressionImplementation.py
- The ROC curve {see terms}
###Chapter 7: Clustering with Python ####Introduction to clustering – what, why, and how?
- What is clustering?
- How is clustering used?
- Why do we do clustering?
####Mathematics behind clustering
- Distances between two observations
- Euclidean distance
- Manhattan distance
- Minkowski distance
- The distance matrix
- Normalizing the distances
- Linkage methods
- Single linkage
- Compete linkage
- Average linkage
- Centroid linkage
- Ward's method uses ANOVA method
- Hierarchical clustering
- K-means clustering
- File: kMeanClustering.py
####Implementing clustering using Python
- File: clusterWine.py
- Importing and exploring the dataset
- Normalizing the values in the dataset
- Hierarchical clustering using scikit-learn
- K-Means clustering using scikit-learn
- Interpreting the cluster
####Fine-tuning the clustering
- The elbow method
- Silhouette Coefficient
###Chapter 8: Trees and Random Forests with Python ####Introducing decision trees
- A decision tree
####Understanding the mathematics behind decision trees
- Homogeneity
- Entropy
- Information gain
- ID3 algorithm to create a decision tree
- Gini index
- Reduction in Variance
- Pruning a tree
- Handling a continuous numerical variable
- Handling a missing value of an attribute
####Implementing a decision tree with scikit-learn
- File: decisionTreeIris.py
- Visualizing the tree
- Picture: dtree2.png
- File: dtree2.dot
- Cross-validating and pruning the decision tree
####Understanding and implementing regression trees
- File: regressionTree.py
- Regression tree algorithm
- Implementing a regression tree using Python
####Understanding and implementing random forests
- File: randomForest.py
- The random forest algorithm
- Implementing a random forest using Python
- Why do random forests work?
- Important parameters for random forests
###Chapter 9: Best Practices for Predictive Modelling ####Best practices for coding
- Commenting the codes
- Defining functions for substantial individual tasks
- Example 1
- Example 2
- Example 3
- Avoid hard-coding of variables as much as possible
- Version control
- Using standard libraries, methods, and formulas
####Best practices for data handling
####Best practices for algorithms
####Best practices for statistics
####Best practices for business contexts