- ML is a technique that uses mathematics and statistics to create a model that can predict unknown values.
- For example, Cycle rental company could use historic data to train a model that predicts daily rental demand in order to make sure sufficient staff and cycles are available.
- To do this, create a ML model that takes information about a specific day (the day of week, weather, ...) as an input (x), the number of rentals for that day is the label (y), and predicts the expected number of rentals as an output.The function (f) that calculates the number of rentals based on the information about the day is encapsulated in a machine learning model. f(x) = y (ML models in mathematical terms).
- Let's explore the steps involved in training and inferencing.
-
- The training data consists of past observations. It contains, the observed attributes or features value (x) and the known value (y) to train a model to predict (known as the label).
- an observation consists of multiple feature values, so x is actually a vector (an array with multiple values), like this: [x1,x2,x3,...].
- Exampe 1 - Ice cream sales : predict the number of ice cream sales based on the weather. The weather (x), and the number of ice creams sold on each day (y).
- Example 2 - Medical : to predict whether or not a patient is at risk of diabetes based on their clinical measurements. The patient's measurements (x)) and the likelihood of diabetes (y for example, 1 for at risk, 0 for not at risk).
- Example 3 - Antarctic research : to predict the species of a penguin based on its physical attributes. The key measurements (x) of the penguin (length of its flippers, width of its bill, and so on) and the species (y) (for example, 0 for Adelie, 1 for Gentoo, or 2 for Chinstrap).
- An algorithm is applied to the data to try to determine a relationship between the features and the label.
- The result of the algorithm is a model that encapsulates : y = f(x)
- Now that the training phase is complete, the trained model can be used for inferencing.
- The training data consists of past observations. It contains, the observed attributes or features value (x) and the known value (y) to train a model to predict (known as the label).
- Supervised ML in which the training data includes both feature values and known label values. Supervised ML is used to train models by determining a relationship between the features and labels in past observations, so that unknown labels can be predicted for features in future cases.
- Regression - the label predicted by the model is a numeric value. For example:
- The number of ice creams sold on a given day, based on the temperature, rainfall, and windspeed.
- The selling price of a property based on its size in square feet, the number of bedrooms it contains, and socio-economic metrics for its location.
- The fuel efficiency (in miles-per-gallon) of a car based on its engine size, weight, width, height, and length.
- Classification - the label represents a categorization, or class.
- Binary classification - the model predicts a binary true/false or positive/negative prediction for a single possible class. For example:
- Whether a patient is at risk for diabetes based on clinical metrics like weight, age, blood glucose level, and so on.
- Whether a bank customer will default on a loan based on income, credit history, age, and other factors.
- Whether a mailing list customer will respond positively to a marketing offer based on demographic attributes and past purchases.
- Multiclass classification extends binary classification to predict a label that represents one of multiple possible classes. For example,
- The species of a penguin (Adelie, Gentoo, or Chinstrap) based on its physical measurements. A penguin can't be both a Gentoo and an Adelie. Use some algorithms to train multilabel classification models, in which there may be more than one valid label for a single observation.
- The genre of a movie (comedy, horror, romance, adventure, or science fiction) based on its cast, director, and budget. A movie could potentially be categorized as both science fiction and comedy.
- Binary classification - the model predicts a binary true/false or positive/negative prediction for a single possible class. For example:
- Regression - the label predicted by the model is a numeric value. For example:
- Unsupervised ML involves training models using data that consists only of feature values without any known labels. Unsupervised ML algorithms determine relationships between the features of the observations in the training data.
- Clustering algorithm identifies similarities between observations based on their features, and groups them into discrete clusters. For example:
- Group similar flowers based on their size, number of leaves, and number of petals.
- Identify groups of similar customers based on demographic attributes and purchasing behavior.
- Clustering algorithm identifies similarities between observations based on their features, and groups them into discrete clusters. For example:
- Regression models are trained to predict numeric label values based on training data that includes both features and known labels.
- The diagram shows four key elements of the training process for supervised ML models:
- Split the training data (randomly).
- Use an algorithm (linear regression) to fit the training data to a model.
- Use the validation data to test the model by predicting labels for the features.
- Compare the known actual labels in the validation dataset to the labels that the model predicted. Then aggregate the differences between the predicted and actual label values to calculate a metric that indicates how accurately the model predicted for the validation data.
- Example - the ice cream sales to predict is the number of ice creams sold that day.
Temperature (x) Ice cream sales (y) 51 1 52 0 67 14 65 14 70 23 69 20 72 23 75 26 73 22 81 30 78 26 83 36 - Training a regression model - splitting the data and using a subset of it to train a model. Here's the training dataset:
Temperature (x) Ice cream sales (y) 51 1 65 14 69 20 72 23 75 26 81 30 - Apply an algorithm (linear regression) to our training data
- Calculate the value of y for a given value of x. The line intercepts the x axis at 50, so when x is 50, y is 0. f(x) = x-50
- Use this function to predict the number of ice creams sold on a day with any given temperature.
- For example, suppose the weather forecast tells us that tomorrow it will be 77 degrees. We can apply our model to calculate 77-50 and predict that we'll sell 27 ice creams tomorrow.
- Evaluating a regression model - To validate the model and evaluate how well it predicts, we held back some data for which we know the label (y) value. Here's the data we held back:
Temperature (x) Ice cream sales (y) 52 0 67 14 70 23 73 22 78 26 83 36 - f(x) = x-50, results in the following predictions:
Temperature (x) Actual sales (y) Predicted sales (ŷ) 52 0 2 67 14 17 70 23 20 73 22 23 78 26 28 83 36 22
- f(x) = x-50, results in the following predictions:
- Regression evaluation metrics
- Mean Absolute Error (MAE): The average difference between predicted values and true values. The mean (average) of the absolute errors (2, 3, 3, 1, 2, and 3) is 2.33. The lower this value is, the better the model is predicting.
- Mean Squared Error (RMSE): which are 4, 9, 9, 1, 4, and 9 => 6.
- Root Mean Squared Error (RMSE): In this case √6, which is 2.45 (ice creams). Compared to the MAE, a larger difference indicates greater variance in the individual errors (for example, with some errors being very small, while others are large).
- Coefficient of Determination (R2): This metric is more commonly referred to as R-Squared, and summarizes how much of the variance between predicted and true values is explained by the model. The closer to 1 this value is, the better the model is performing. R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
-
Like regression, follows the same iterative process of training, validating, and evaluating models.
-
the algorithms used to train classification models calculate probability values (true or false, 1 or 0) for class or category assignment and the evaluation metrics used to assess model performance compare the predicted classes to the actual classes.
-
Example - Use the blood glucose level of a patient to predict whether or not the patient has diabetes.
Blood glucose (x) Diabetic? (y) 67 0 103 1 114 1 72 0 116 1 65 0 -
Training - calculates the probability of the class label being true (diabetes). Probability is measured as a value between 0.0 and 1.0, such that the total probability for all possible classes is 1.0. So for example, if the probability of a patient having diabetes is 0.7, then there's a corresponding probability of 0.3 that the patient isn't diabetic.
- There are many algorithms that can be used for binary classification, such as logistic regression, which derives a sigmoid (S-shaped) function with values between 0.0 and 1.0, like this:
- the probability of y being true (y=1) for a given value of x : f(x) = P(y=1 | x)
- 3 observations in the training data is true, y=1 is 1.0. Other hand, y is false, so the probability that y=1 is 0.0.
- The S-shaped curve describes the probability distribution so that plotting a value of x on the line identifies the corresponding probability that y is 1.
-
Evaluating - a random subset of data with which to validate the trained model:
Blood glucose (x) Diabetic? (y) 66 0 107 1 112 1 71 0 87 1 89 1 - compare the predicted class labels (ŷ) to the actual class labels (y), as shown here:
Blood glucose (x) Actual diabetes diagnosis (y) Predicted diabetes diagnosis (ŷ) 66 0 0 107 1 1 112 1 1 71 0 0 87 1 0 89 1 1
- compare the predicted class labels (ŷ) to the actual class labels (y), as shown here:
-
Binary classification evaluation metrics - create a matrix of the number of correct and incorrect predictions for each possible class label:
- This visualization is called a confusion matrix, and it shows the prediction totals where: ŷ=0 and y=0: True negatives (TN) ŷ=1 and y=0: False positives (FP) ŷ=0 and y=1: False negatives (FN) ŷ=1 and y=1: True positives (TP)
- Accuracy - The simplest metric you can calculate from the confusion matrix is accuracy - the proportion of predictions that the model got right. Accuracy is calculated as: (TN+TP) ÷ (TN+FN+FP+TP)
- the calculation is: (2+3) ÷ (2+1+0+3) = 5 ÷ 6 = 0.83
- So for our validation data, the diabetes classification model produced correct predictions 83% of the time.
- Recall - is a metric that measures the proportion of positive cases that the model identified correctly. In other words, compared to the number of patients who have diabetes, how many did the model predict to have diabetes?
- The formula for recall is: TP ÷ (TP+FN) -> 3 ÷ (3+1) => = 0.75
- So our model correctly identified 75% of patients who have diabetes as having diabetes.
- Precision - is a similar metric to recall, but measures the proportion of predicted positive cases where the true label is actually positive. In other words, what proportion of the patients predicted by the model to have diabetes actually have diabetes?
- The formula for precision is: TP ÷ (TP+FP) => 3 ÷ (3+0) = 1.0
- So 100% of the patients predicted by our model to have diabetes do in fact have diabetes.
- F1-score - is an overall metric that combined recall and precision. The formula for F1-score is: (2 x Precision x Recall) ÷ (Precision + Recall) => (2 x 1.0 x 0.75) ÷ (1.0 + 0.75) => = 0.86
-
Area Under the Curve (AUC) - Another name for recall is the true positive rate (TPR), and there's an equivalent metric called the false positive rate (FPR) that is calculated as FP÷(FP+TN).
-
Predict to which of multiple possible classes an observation belongs.
-
Example - Calculate probability values for multiple class labels, enabling a model to predict the most probable class for a given observation.
- some observations of penguins, in which the flipper length (x) of each penguin is recorded. For each observation, the data includes the penguin species (y), which is encoded as follows: 0: Adelie, 1: Gentoo, 2: Chinstrap
Flipper length (x) Species (y) 167 0 172 0 225 2 197 1 189 1 232 2 158 0
- some observations of penguins, in which the flipper length (x) of each penguin is recorded. For each observation, the data includes the penguin species (y), which is encoded as follows: 0: Adelie, 1: Gentoo, 2: Chinstrap
-
Training - calculates a probability value for each possible class. There are two kinds of algorithm you can use to do this:
- One-vs-Rest (OvR) algorithms
- Multinomial algorithms
-
One-vs-Rest (OvR) algorithms
- train a binary classification function for each class, each calculating the probability that the observation is an example of the target class.
- Each function calculates the probability of the observation being a specific class compared to any other class.
- For our penguin species classification model, the algorithm would essentially create three binary classification functions:
- f0(x) = P(y=0 | x)
- f1(x) = P(y=1 | x)
- f2(x) = P(y=2 | x)
- Each algorithm produces a sigmoid function that calculates a probability value between 0.0 and 1.0.
- A model trained using this kind of algorithm predicts the class for the function that produces the highest probability output.
-
Multinomial algorithms
- which creates a single function that returns a multi-valued output.
- The output is a vector (an array of values) that contains the probability distribution for all possible classes - with a probability score for each class which when totaled add up to 1.0: f(x) =[P(y=0|x), P(y=1|x), P(y=2|x)]
- An example of this kind of function is a softmax function, which could produce an output like the following example: [0.2, 0.3, 0.5]
- The elements in the vector represent the probabilities for classes 0, 1, and 2 respectively; so in this case, the class with the highest probability is 2.
-
Evaluating - calculate aggregate metrics that take all classes into account.
- Let's assume that we've validated our multiclass classifier, and obtained the following results:
Flipper length (x) Actual species (y) Predicted species (ŷ) 165 0 0 171 0 0 205 2 1 195 1 1 183 1 1 221 2 2 214 2 2 - The confusion matrix for a multiclass classifier is similar to that of a binary classifier, except that it shows the number of predictions for each combination of predicted (ŷ) and actual class labels (y):
- From this confusion matrix, we can determine the metrics for each individual class as follows:
Class TP TN FP FN Accuracy Recall Precision F1-Score 0 2 5 0 0 1.0 1.0 1.0 1.0 1 2 4 1 0 0.86 1.0 0.67 0.8 2 2 4 0 1 0.86 0.67 1.0 0.8 - To calculate the overall accuracy, recall, and precision metrics, you use the total of the TP, TN, FP, and FN metrics:
- Overall accuracy = (13+6)÷(13+6+1+1) = 0.90
- Overall recall = 6÷(6+1) = 0.86
- Overall precision = 6÷(6+1) = 0.86
- The overall F1-score is calculated using the overall recall and precision metrics: Overall F1-score = (2x0.86x0.86)÷(0.86+0.86) = 0.86
- Let's assume that we've validated our multiclass classifier, and obtained the following results:
-
Observations are grouped into clusters based on similarities in their data values, or features.
-
It doesn't make use of previously known label values to train a model.
-
In a clustering model, the label is the cluster to which the observation is assigned, based only on its features.
-
Example - A botanist observes a sample of flowers and records the number of leaves and petals on each flower:
-
Training - There are multiple algorithms you can use for clustering. One of the most commonly used algorithms is K-Means clustering, which consists of the following steps:
- The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features).
- In the flower example, we have two features: number of leaves (x1) and number of petals (x2). So, the feature vector has two coordinates that we can use to conceptually plot the data points in two-dimensional space ([x1,x2])
- You decide how many clusters you want to use to group the flowers - call this value k.
- For example, to create three clusters, you would use a k value of 3. Then k points are plotted at random coordinates. These points become the center points for each cluster, so they're called centroids.
- Each data point (in this case a flower) is assigned to its nearest centroid.
- Each centroid is moved to the center of the data points assigned to it based on the mean distance between the points.
- After the centroid is moved, the data points may now be closer to a different centroid, so the data points are reassigned to clusters based on the new closest centroid.
- The centroid movement and cluster reallocation steps are repeated until the clusters become stable or a predetermined maximum number of iterations is reached.
- The following animation shows this process:
- The feature (x) values are vectorized to define n-dimensional coordinates (where n is the number of features).
-
Evaluating - No known label, it's separated from one another. There are multiple metrics that you can use to evaluate cluster separation, including:
- Average distance to cluster center: How close, on average, each point in the cluster is to the centroid of the cluster.
- Average distance to other center: How close, on average, each point in the cluster is to the centroid of all other clusters.
- Maximum distance to cluster center: The furthest distance between a point in the cluster and its centroid.
- Silhouette: A value between -1 and 1 that summarizes the ratio of distance between points in the same cluster and points in different clusters (The closer to 1, the better the cluster separation).Predict to which of multiple possible classes an observation belongs.
-
Deep learning is an advanced form of ML that tries to emulate the way the human brain learns.
-
The key to deep learning is the creation of an artificial neural network that simulates electrochemical activity in biological neurons by using mathematical functions, as shown here.
-
Artificial neural networks are made up of multiple layers of neurons - essentially defining a deeply nested function. This architecture is the reason the technique is referred to as deep learning and the models produced by it are often referred to as deep neural networks (DNNs). You can use deep neural networks for many kinds of machine learning problem, including regression and classification, as well as more specialized models for natural language processing and computer vision.
-
Example - define a classification model for penguin species. Diagram of a neural network used to classify a penguin species.
- The feature data (x) consists of some measurements of a penguin. Specifically, the measurements are:
- The length of the penguin's bill.
- The depth of the penguin's bill.
- The length of the penguin's flippers.
- The penguin's weight.
- In this case, x is a vector of four values, or mathematically, x=[x1,x2,x3,x4].
- The label we're trying to predict (y) is the species of the penguin, and that there are three possible species it could be: Adelie, Gentoo, Chinstrap
- This is an example of a classification problem, in which the ML model must predict the most probable class to which an observation belongs.
- A classification model accomplishes this by predicting a label that consists of the probability for each class. In other words, y is a vector of three probability values; one for each of the possible classes: [P(y=0|x), P(y=1|x), P(y=2|x)].
- The process for inferencing a predicted penguin class using this network is:
- The feature vector for a penguin observation is fed into the input layer of the neural network, which consists of a neuron for each x value. In this example, the following x vector is used as the input: [37.3, 16.8, 19.2, 30.0]
- The functions for the first layer of neurons each calculate a weighted sum by combining the x value and w weight, and pass it to an activation function that determines if it meets the threshold to be passed on to the next layer.
- Each neuron in a layer is connected to all of the neurons in the next layer (an architecture sometimes called a fully connected network) so the results of each layer are fed forward through the network until they reach the output layer.
- The output layer produces a vector of values; in this case, using a softmax or similar function to calculate the probability distribution for the three possible classes of penguin. In this example, the output vector is: [0.2, 0.7, 0.1]
- The elements of the vector represent the probabilities for classes 0, 1, and 2. The second value is the highest, so the model predicts that the species of the penguin is 1 (Gentoo).
-
How does a neural network learn?
- The weights in a neural network are central to how it calculates predicted values for labels.
- During the training process, the model learns the weights that will result in the most accurate predictions.
- Let's explore the training process in a little more detail to understand how this learning takes place.
-
- The training and validation datasets are defined, and the training features are fed into the input layer.
- The neurons in each layer of the network apply their weights (which are initially assigned randomly) and feed the data through the network.
- The output layer produces a vector containing the calculated values for ŷ. For example, an output for a penguin class prediction might be [0.3. 0.1. 0.6].
- A loss function is used to compare the predicted ŷ values to the known y values and aggregate the difference (which is known as the loss). For example, if the known class for the case that returned the output in the previous step is Chinstrap, then the y value should be [0.0, 0.0, 1.0]. The absolute difference between this and the ŷ vector is [0.3, 0.1, 0.4]. In reality, the loss function calculates the aggregate variance for multiple cases and summarizes it as a single loss value.
- Since the entire network is essentially one large nested function, an optimization function can use differential calculus to evaluate the influence of each weight in the network on the loss, and determine how they could be adjusted (up or down) to reduce the amount of overall loss. The specific optimization technique can vary, but usually involves a gradient descent approach in which each weight is increased or decreased to minimize the loss.
- The changes to the weights are backpropagated to the layers in the network, replacing the previously used values.
- The process is repeated over multiple iterations (known as epochs) until the loss is minimized and the model predicts acceptably accurately.
-
MS Azure ML is a cloud service for training, deploying, and managing ML models. Manage the end-to-end lifecycle of machine learning projects, including:
- Exploring data and preparing it for modeling.
- Training and evaluating machine learning models.
- Registering and managing trained models.
- Deploying trained models for use by applications and services.
- Reviewing and applying responsible AI principles and practices.
-
Features and capabilities of Azure Machine Learning
- Centralized storage and management of datasets for model training and evaluation.
- On-demand compute resources on which you can run machine learning jobs, such as training a model.
- Automated machine learning (AutoML), which makes it easy to run multiple training jobs with different algorithms and parameters to find the best model for your data.
- Visual tools to define orchestrated pipelines for processes such as model training or inferencing.
- Integration with common machine learning frameworks such as MLflow, which make it easier to manage model training, evaluation, and deployment at scale.
- Built-in support for visualizing and evaluating metrics for responsible AI, including model explainability, fairness assessment, and others.
-
Provisioning Azure Machine Learning resources
-
Azure Machine Learning studio
- After you've provisioned, use the studio.
- In Azure Machine Learning studio, you can (among other things):
- Import and explore data.
- Create and use compute resources.
- Run code in notebooks.
- Use visual tools to create jobs and pipelines.
- Use automated machine learning to train models.
- View details of trained models, including evaluation metrics, responsible AI information, and training parameters.
- Deploy trained models for on-request and batch inferencing.
- Import and manage models from a comprehensive model catalog.
- The screenshot shows the Metrics page for a trained model in studio, see the evaluation metrics for a trained multiclass classification model.
- Explore the automated machine learning capability in Azure ML, and use it to train and evaluate a machine learning model.
- Create an Azure Machine Learning workspace
- Azure Portal -> Create Resource -> new Azure Machine Learning -> Create
- Select Launch Studio (open new browser https://ml.azure.com) -> Sign -> See newly created workspace (if not, select All workspaces)
- Enable preview features
- Use automated machine learning to train a model
-
Automated ML enables you to try multiple algorithms and parameters to train multiple models, and identify the best one for your data.
-
Use a dataset of historical bicycle rental details to train a model that predicts the number of bicycle rentals that should be expected on a given day, based on seasonal and meteorological features. (https://capitalbikeshare.com/system-data)
-
Studio -> Automated ML page (under Authoring) > New Automated ML job
- Basic settings -> Job name: mslearn-bike-automl, New experiment name: mslearn-bike-rental
- Task type & data -> Select task type: Regression, Select dataset: Create -> Name: bike-rentals, Type: Tabular, Web URL: https://aka.ms/bike-rentals, Column headers: Only first file has headers, Review
- Select the bike-rentals dataset after you’ve created it.
- Task settings ->
- Target column: Rentals (integer)
- Additional configuration settings: Explain best model: Unselected, Use all supported models: Unselected. Allowed models: Select only RandomForest and LightGBM,
- Limits: Max trials: 3, Max concurrent trials: 3, Max nodes: 3, Metric score threshold: 0.85, Timeout: 15, Iteration timeout: 5, Enable early termination: Selected
- Validation and test: Validation type: Train-validation split
- Compute: -> Submit the training job. It starts automatically -> Wait for the job to finish.
-
Deploy and test the model -> select Deploy and use the Web service option: Name: predict-rentals, Compute type: Azure Container Instance, Enable authentication: Selected -> Deploy -> Wait for the Deploy status to change to Succeeded. This may take 5-10 minutes.
-
Test the deployed service
- Endpoints -> predict-rentals -> Test -> Input
{ "Inputs": { "data": [ { "day": 1, "mnth": 1, "year": 2022, "season": 2, "holiday": 0, "weekday": 1, "workingday": 1, "weathersit": 2, "temp": 0.3, "atemp": 0.3, "hum": 0.3, "windspeed": 0.3 } ] }, "GlobalParameters": 1.0 }
- Result
{ "Results": [ 444.27799000000000 ] }
-
-
You want to create a model to predict the cost of heating an office building based on its size in square feet and the number of employees working there. What kind of machine learning problem is this?
- Regression
- Classification
- Clustering
-
You need to evaluate a classification model. Which metric can you use?
- Mean squared error (MSE)
- Precision
- Silhouette
-
In deep learning, what is the purpose of a loss function?
- To remove data for which no known label values are provided
- To evaluate the aggregate difference between predicted and actual label values
- To calculate the cost of training a neural network rather than a statistical model
-
What does automated machine learning in Azure Machine Learning enable you to do?
- Automatically deploy new versions of a model as they're trained
- Automatically provision Azure Machine Learning workspaces for new data scientists in an organization
- Automatically run multiple training jobs using different algorithms and parameters to find the best model