Running XGBoost on Azure HDInsight Spark clusters

Introduction

This repository has detailed steps on to build, install, and run XGBoost on HDInsight, the managed Hadoop and Spark solution on Azure.

XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

It is not designed as a generic Machine Learning framework; it is designed as a library very specialized in boosting tree algorithm, and is widely used from production to experimental projects.

For more details on XGBoost, please go to XGBoost GitHub page.

XGBoost with Spark

The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.

With XGBoost4J-Spark, users are able to use both low- and high-level memory abstraction in Spark, i.e. RDD and DataFrame/Dataset. The DataFrame/Dataset abstraction grants the user to manipulate structured datasets and utilize the built-in routines in Spark or User Defined Functions (UDF) to explore the value distribution in columns before they feed data into the machine learning phase in the pipeline. In the following example, the structured sales records can be saved in a JSON file, parsed as DataFrame through Spark's API and feed to train XGBoost model in two lines of Scala code.

High-level steps

There are a few high-level steps that you need to do:

Building XGBoost from source code
Start a Spark session with XGBoost4J-Spark library loaded
Import XGBoost and Spark Packages
Train a simple XGBoost model
Tune Hyper Parameters for your XGBoost Model using Spark
Explain Parameters of the XGBoost Model
Save the model to Azure Storage

Please refer to the Jupyter Notebook for more detailed steps.

If you have any questions or feedback for this repo, feel free to send us email (hdifeedback at microsoft dot com).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CompileXGBoost.sh		CompileXGBoost.sh
README.md		README.md
Running_XGBoost_on_HDInsight_Spark.ipynb		Running_XGBoost_on_HDInsight_Spark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running XGBoost on Azure HDInsight Spark clusters

Introduction

XGBoost

XGBoost with Spark

High-level steps

About

Releases

Packages

Contributors 2

Languages

hdinsight/XGBoost-on-HDInsight

Folders and files

Latest commit

History

Repository files navigation

Running XGBoost on Azure HDInsight Spark clusters

Introduction

XGBoost

XGBoost with Spark

High-level steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages