Python cheatsheet

General useful functions

Get only categorical, numerical

df.describe(include=[np.object]) # string
df.describe(include=[np.number]) # numeric

Logical indexing (in-place modif) credit = credit[credit['Age'] > 0]
Enumerate animals = ["cat", "bird", "dog"]

Use enumerate to get indexes and elements from an iterable

... This unpacks a tuple.

    print(i, element)

Does not unpack the tuple.*

    print(x, "UNPACKED =", x[0], x[1]]

Output

1 bird
2 dog
(0, 'cat') UNPACKED = 0 cat
(1, 'bird') UNPACKED = 1 bird
(2, 'dog') UNPACKED = 2 dog

loc

            max_speed  shield
viper               4       5
sidewinder          7       

>>> unique_index = pd.Index(list('abc'))
>>> unique_index.get_loc('b')

Pandas

read

spambase = pd.read_csv(data_path, delimiter = ',')

extract input features x = spambase_binary
test outcome y = pd.Series(y_test["is_spam"])
feed the Multionomial NB classifier

mnb.fit(x,y);
mnb.predict(X_test)

Others

Pandas DataFrame

df = pd.DataFrame(data=spambase)

Pandas Series

One-dimensional ndarray with axis labels (including time series). s = pd.Series(data, index=index)

working with data

Convert categorical variable into dummy/indicator variables pd.get_dummies(s)

ML stuff

https://www.youtube.com/user/mathematicalmonk

https://www.coursera.org/specializations/mathematics-machine-learning

scikit

MultinomialNB - naive Bayes algorithm for multinomially distributed data

Regression

https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/linear-nonlinear-and-monotonic-relationships/

When both variables increase or decrease concurrently and at a constant rate, a positive linear relationship exists. The points in Plot 1 follow the line closely, suggesting that the relationship between the variables is strong.

When one variable increases while the other variable decreases, a negative linear relationship exists. The points in Plot 2 follow the line closely, suggesting that the relationship between the variables is strong.

The data points in Plot 3 appear to be randomly distributed. They do not fall close to the line indicating a very weak relationship if one exists.

If a relationship between two variables is not linear, the rate of increase or decrease can change as one variable changes, causing a "curved pattern" in the data. This curved trend might be better modeled by a nonlinear function, such as a quadratic or cubic function, or be transformed to make it linear. Plot 4 shows a strong relationship between two variables.

In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate. In a linear relationship, the variables move in the same direction at a constant rate. Plot 5 shows both variables increasing concurrently, but not at the same rate. This relationship is monotonic, but not linear.

Logistic regression

Good function explanation: https://www.coursera.org/lecture/machine-learning/decision-boundary-WuL1H

coefficient:

coef_

The weight/coefficient matrix of a generalised linear model predictor, of shape (n_features,) for binary classification and single-output regression, (n_classes, n_features) for multiclass classification and (n_targets, n_features) for multi-output regression. Note this does not include the intercept (or bias) term, which is stored in intercept_.

A regression coefficient describes the size and direction of the relationship between a predictor and the response variable. Coefficients are the numbers by which the values of the term are multiplied in a regression equation.

Each of these correspond formally to a test of the null hypothesis that the coefficient in question is zero, while all the others are not (also known as the Wald test).

null hypothesis: null hypothesis that the coefficient is equal to zero (no effect)

Confusion matrix

I'm assuming that M[i,j] stands for Element of real class i was classified as j. If its the other way around you are going to need to transpose everything I say. I'm also going to use the following matrix for concrete examples:

1 2 3
4 5 6
7 8 9

There are essentially two things you can do: Finding how each class has been classified The first thing you can ask is what percentage of elements of real class i here classified as each class. To do so, we take a row fixing the i and divide each element by the sum of the elements in the row. In our example, objects from class 2 are classified as class 1 4 times, are classified correctly as class 2 5 times and are classified as class 3 6 times. To find the percentages we just divide everything by the sum 4 + 5 + 6 = 15

4/15 of the class 2 objects are classified as class 1
5/15 of the class 2 objects are classified as class 2
6/15 of the class 2 objects are classified as class 3

Finding what classes are responsible for each classification The second thing you can do is to look at each result from your classifier and ask how many of those results originate from each real class. Its going to be similar to the other case but with columns instead of rows. In our example, our classifier returns "1" 1 time when the original class is 1, 4 times when the original class is 2 and 7 times when the original class is 3. To find the percentages we divide by the sum 1 + 4 + 7 = 12

1/12 of the objects classified as class 1 were from class 1
4/12 of the objects classified as class 1 were from class 2
7/12 of the objects classified as class 1 were from class 3

-- Of course, both the methods I gave only apply to single row column at a time and I'm not sure if it would be a good idea to actually modify your confusion matrix in this form. However, this should give the percentages you are looking for.

Resources

https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
Interpretation: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python_ml_cheatsheet.md

python_ml_cheatsheet.md

Python cheatsheet

General useful functions

Use enumerate to get indexes and elements from an iterable

Pandas

read

Others

Pandas DataFrame

Pandas Series

working with data

ML stuff

scikit

Regression

Logistic regression

Good function explanation: https://www.coursera.org/lecture/machine-learning/decision-boundary-WuL1H

coefficient:

Confusion matrix

Resources

Files

python_ml_cheatsheet.md

Latest commit

History

python_ml_cheatsheet.md

File metadata and controls

Python cheatsheet

General useful functions

Use enumerate to get indexes and elements from an iterable

Pandas

read

Others

Pandas DataFrame

Pandas Series

working with data

ML stuff

scikit

Regression

Logistic regression

Good function explanation: https://www.coursera.org/lecture/machine-learning/decision-boundary-WuL1H

coefficient:

Confusion matrix

Resources