Skip to content

Latest commit

 

History

History
247 lines (153 loc) · 504 KB

Watson Machine Learning & AutoAI.md

File metadata and controls

247 lines (153 loc) · 504 KB

Cross-Industry Standard Process for Data Mining overview

  • CRISP-DM stands for Cross-Industry Standard Process for Data Mining.
  • It is an open standard guide that describes common approaches that are used by data mining experts.
  • As a methodology, CRISP-DM includes descriptions of the typical phases of a project, including tasks details.
  • As a process model, CRISP-DM provides an overview of the data mining lifecycle.

CRISP-DM is an industry-proven way to guide your data mining efforts. It is the most widely used analytics model. It describes common approaches that are used by data mining experts. As a methodology, it includes descriptions of the typical phases of a project, the tasks that are involved with each phase, and an explanation of the relationships between these tasks. As a process model, CRISP-DM provides an overview of the data mining lifecycle. In a nutshell, it consolidates preferred practices.

CRISP-DM lifecycle

![[Pasted image 20230624124415.png]] The lifecycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary.

It starts with business understanding, and then moves to data understanding, data preparation, modelling, evaluation, and deployment.

The CRISP-DM model is flexible and can be customized easily. For example, if your organization aims to detect money laundering, it is likely that you examine large amounts of data without a specific modelling goal. Instead of modelling, your work focuses on data exploration and visualization to uncover suspicious patterns in financial data.

In such a situation, the modelling, evaluation, and deployment phases might be less relevant than the data understanding and preparation phases. However, it is still important to consider some of the questions that are raised during these later phases for long-term planning and future data mining goals.

Data preparation

The process of preparing data for a machine learning algorithm has the following phases:

  • Data selection
  • Data preprocessing
  • Data transformation

Machine learning algorithms depend highly on the quality and quantity of data. You must provide these algorithms with the correct data. Data preparation is a large subject that can involve many iterations, exploration, and analysis. Becoming proficient at data preparation will make you a master at machine learning.

Data selection

Think about:

  • What data is available, what data is missing, and what data can be removed?
  • Is the selected sample an accurate representation of the entire population?
  • Is more data better?

This step is concerned with selecting a subset of all the available data with which you are working. Consider what data is available, what data is missing, and what data can be removed.

Make some assumptions about the data that you require and record those assumptions. Ask yourself:

  • What data is available? For example, through which media, such as database tables or other systems? Is it structured or unstructured? Make sure that you are aware of everything that you can use.
  • What data is not available, and what data do you want to get? For example, data that is not or cannot be recorded. Can you develop or simulate this data?
  • What data can be removed (because it does not address the problem)? Document which data you excluded and why.

It is common to think that the more data we have the better result we get, but this is not necessarily true. According to the paper "Scaling to Very Very Large Corpora for Natural Language Disambiguation" by Microsoft Researchers Banko and Brill [2001] [1], for a problem, different algorithms perform virtually the same and the accuracy increases when adding more data (in this case, words). But, for large amounts of data, the improvements start to become negligible.

Data selection : Samples selection

Selecting the correct size of the sample is a key step in data preparation. Samples that are too large or too small might give skewed results.

  • Sampling noise
  • Sampling bias

Selecting the correct size of the sample is a key step in data preparation. Samples that are too large or too small might give skewed results.

Sampling noise: Smaller samples cause sampling noise because they are trained on non-representative data.

Sampling bias: A sample is biased if certain samples are underrepresented or overrepresented relative to others in the population. Larger samples work well if there is no sampling bias, that is, when the correct data is picked.

Data Selection : Survey Sampling

A good sample is representative, meaning that each sample point represents the attributes of a known number of population elements. The bias that results from an unrepresentative sample is called selection bias. There are various types of bias in surveys, such as undercoverage, non-response bias, and voluntary response bias. To explain these types, assume that you want to conduct a survey about customers and their purchase preferences:

  • Undercoverage: Undercoverage occurs when some members of the population are inadequately represented in the sample. For example, when a minority is underrepresented in a sample, this situation can affect the validity of the sample. In this example, assume that you underrepresented the customers who have a low income.
  • Non-response bias: Sometimes, members that are chosen for the sample are unwilling or unable to participate in the survey. Non-response bias is the bias that results when respondents differ in meaningful ways from non-respondents. This bias can affect the validity of the sample. Non-response bias is a common problem with mail surveys because the response rate is often low, making mail surveys vulnerable to non-response bias.
  • Voluntary response bias: This ias occurs when sample members are self-selected volunteers, as in voluntary samples. When this happens, the resulting sample tends to over represent individuals who have strong opinions.

Data Preprocessing

Data challenges:

  • Noise and outliers
  • Missing values
  • Inconsistent values
  • Duplicate data

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Why we use data preprocessing? In actual scenarios, data is generally incomplete:

  • Data might lack attribute values, lacking certain attributes of interest.
  • Data is noisy when it contains errors or outliers. For example, human height as a negative value.
  • Data might be inconsistent, containing discrepancies in codes or names. For example, a record for two employees having the same ID.

We usually come across much raw data that is not fit to be readily processed by machine learning algorithms. We must preprocess the raw data before it is fed into various machine learning algorithms. Organize your selected data by formatting, cleaning, and sampling from the data. Poor data quality negatively affects many data processing efforts.

Data Preprocessing : Steps Overview
  1. Data cleaning: Complete missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
    1. Data integration: Using multiple databases, other data sources, or files.
  2. Data sampling: Faster for exploring and prototyping. 
  3. Data dimensionality reduction: Reducing the volume but producing the same or similar analytical results.
  4. Data formatting: The data that you selected might not be in a format that is suitable for you to use.

The following list summarizes the steps that are used in preprocessing data:

  1. Data cleaning: Complete missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

  2. Data integration: Using multiple databases, other data sources, or files.

  3. Data sampling: You can take a smaller representative sample of the selected data that might be much faster for exploring and prototyping solutions before considering the whole data set.

  4. Data dimensionality reduction: Reducing the dimensions of the data and producing the same or similar analytical results. This is a kind of data compression that uses less memory or disk space and speeds the learning algorithm. For example, assume that you have a feature for size in centimeters and another feature for size in inches. By removing one of those redundant features, you reduce the dimensions of your data from 3D to 2D.

  5. Data formatting: The data that you selected might not be in a format that is suitable for you to use. For example, the data might be in a relational database and you want it in a comma-separated file.

Data Transformation

Data transformation:

  • Scaling
  • Aggregation
  • Decomposition

Data transformation (also called feature engineering) is a set of actions that covers transformation of the processed data. Engineering features from your data can take some time, but they can enhance the machine learning performance. There are three common data transformations:

  • Scaling: The pre-processed data might contain attributes with a mixture of scales for various quantities, such as meters, grams, and dollars. The features should have the same scale, for example, 0 (smallest value) to 1 (largest value).
  • Aggregation: There might be features that can be aggregated into a single feature, which is more meaningful to the problem that you are trying to solve.
  • Decomposition: There might be complex features where it is more useful to split into parts. For example, a feature that represents a date and time stamp in a long format can be split out further into only the hour of the day. Think about what your problem really needs.

Watson Machine Learning

  • Watson Machine Learning is a service on IBM Cloud with features for training and deploying machine learning models and neural networks.
  • Watson Machine Learning is integrated with IBM Watson Studio. 
  • Enables users to perform two fundamental operations of machine learning: training and scoring.
  • Training is the process of refining an algorithm so that it can learn from a data set.
  • Scoring is the operation of predicting an outcome by using a trained model.

Watson Machine Learning is a service on IBM Cloud with features for training and deploying machine learning models and neural networks. To design, train, and deploy machine learning models in IBM Watson Studio, you must associate a Watson Machine Learning service instance and supporting services (such as IBM Cloud Object Storage) with a project.

Watson Machine Learning enables users to perform two fundamental operations of machine learning: training and scoring.

  • Training is the process of refining an algorithm so that it can learn from a data set. The output of this operation is called a model. A model encompasses the learned coefficients of mathematical expressions.
  • Scoring is the operation of predicting an outcome by using a trained model. The output of the scoring operation is another data set containing predicted values.

Watson Machine Learning features

Interfaces for building, training, and deploying models: Python client library Command-line interface (CLI) REST API

  • Deployment infrastructure
  • Distributed deep learning
  • GPUs for faster training

The Watson Machine Learning service provides the following features:

  • Interfaces for building, training, and deploying models: a Python client library, CLI, and a REST API.
  • Deployment infrastructure for hosting your trained models. Although training is a critical step in the machine learning process, Watson Machine Learning enables you to streamline the functioning of your models by deploying them and getting business value from them over time and through all their iterations.
  • Distributed deep learning for distributing training runs across multiple servers.
  • GPUs for faster training.

Watson Machine Learning personas

Data scientists:

  • Use data transformations and machine learning algorithms.
  • Use notebooks or external tools.
  • Often collaborate with data engineers to explore and understand the data.

Developers build intelligent applications that use the predictions output of machine learning models.

Data scientists create machine learning pipelines that use data transformations and machine learning algorithms. They typically use Notebooks or external tools to train and evaluate their models. Data scientists often collaborate with data engineers to explore and understand the data.

Developers build intelligent applications that use the predictions that are output by machine learning models.

Watson Machine Learning: Building the model

You can build your model by using one of the following tools (shown on this and the next slides):

  1. Notebook.
  2. SPSS Modeler flow (Flow Editor):
  • Use the Flow Editor to create a machine learning flow.
  • Use the Flow Editor to create a deep learning flow.
  • Use the Flow Editor to create SparkML flow type.

Data scientists can use Jupyter notebooks to load and process data, create and deploy Watson Machine Learning models.

You can use a notebook to create or use a machine learning model. You can use the notebook to write the code and implement the machine learning API. After the model is created, trained, and deployed you can run the deployed model in a notebook.

With SPSS Modeler flows in Watson Studio, you can quickly develop predictive models by using business expertise and deploying them into business operations to improve decision making. These flows are designed around the long-established SPSS Modeler client software and the industry-standard CRISP-DM model.

You can create a machine learning flow, which is a graphical representation of a data model, or a deep learning flow, which is a graphical representation of a neural network design, by using the SPSS modeler (Flow Editor). Use it to prepare or shape data, train or deploy a model, or transform data and export it back to a database table or file in IBM Cloud Object Storage.

You can also create a machine learning model with Apache Spark MLlib nodes by adding the Modeler flow asset type to your project and then selecting SparkML as the flow type.

AutoAI

AutoAI or AutoML generally refers to an algorithmic process or set of processes to create or discover the best pipelines for a given data set and prediction problem (problem type or metric)

Benefits
  • Build models faster
  • Automate data preparation and model development
  • Find signal from noise
  • Auto-feature engineering makes it easy to extract more predictive power from your data
  • Rank and explore models
  • Quickly compare candidate pipelines to find the best model for the job
  • Data pre-processing
  • Automated model selection
  • Automated feature engineering
  • Hyperparameter optimization

The AutoAI graphical tool in Watson Studio automatically analyzes your data and generates candidate model pipelines that are customized for your predictive modeling problem. These model pipelines are created over time as AutoAI analyzes your data set and discovers data transformations, algorithms, and parameter settings that work best for your problem setting. Results are displayed on a leaderboard, showing the automatically generated model pipelines that are ranked according to your problem optimization objective.

![[Pasted image 20230624125717.png]]

AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:

Data pre-processing

AutoAI applies various algorithms, or estimators, to analyze, clean, and prepare your raw data for machine learning. It automatically detects and categorizes features based on data type, such as categorical or numerical. Depending on the categorization, it uses hyperparameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.

Automated model selection

AutoAI uses a novel approach that enables testing and ranking candidate algorithms against small subsets of the data, gradually increasing the size of the subset for the most promising algorithms to arrive at the best match. This approach saves time without sacrificing performance. It enables ranking many candidate algorithms and selecting the best match for the data.

Automated feature engineering

Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction. AutoAI uses a novel approach that explores various feature construction choices in a structured, non-exhaustive manner while progressively maximizing model accuracy by using reinforcement learning. This process results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.

Hyperparameter optimization

A hyperparameter optimization step refines the best performing model pipelines. AutoAI uses a novel hyperparameter optimization algorithm that is optimized for costly function evaluations, such as model training and scoring, that are typical in machine learning. This approach enables fast convergence to a good solution despite long evaluation times of each iteration.

Terminologies

IBM implementation of AutoAI in Watson Studio adopts open source terminology.

  • Machine learning pipeline: A set of steps for creating a model (a workflow). Typical steps in a pipeline are: ingest, clean, transform, and model with hyperparameter optimization.
  • Estimators: Algorithms or models. For example: logistic regression, and random forest.
  • Hyper Parameter Optimization (HPO): The process of training the models with different parameters (specific to each algorithm).
  • Model evaluation metrics: Various model evaluation metrics used by data scientists, for example AUC-ROC and F1 Score..