English version | 中文版
OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.
Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.
Efficiency
- We propose an efficient customized sparse format to handle sparse features. Together with our fine-grained optimization, such as cache-conscious algorithms, asynchronous cache read and write, and lightweight locks to maximize parallelism. OpenEmbedding is able to achieve the performance speedup of 3-8x compared with the Allreduce-based solution on a single machine equipped with 8 GPUs for sparse model training.
Ease-of-use
- We have integrated OpenEmbedding into Tensorflow. Only three lines of code changes are required to utilize OpenEmbedding in Tensorflow for both training and inference.
Adaptability
- In addition to Tensorflow, it is straightforward to integrate OpenEmbedding into other popular frameworks. We have demonstrated the integration with DeepCTR and Horovod in the examples.
For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.
You can install and run OpenEmbedding by the following steps. The examples show the whole process of training criteo data with OpenEmbedding and predicting with Tensorflow Serving.
NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from Docker Hub.
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
apt install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
yum install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
The installation usually requires g++ 7 or higher, or a compiler compatible with tf.version.COMPILER_VERSION
. The compiler can be specified by environment variable CC
and CXX
. Currently OpenEmbedding can only be installed on linux.
CC=gcc CXX=g++ pip3 install openembedding
If TensorFlow was updated, you need to reinstall OpenEmbedding.
pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding
A sample program for common usage is as follows.
Create Model
and Optimizer
.
import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')
Transform to distributed Model
and distributed Optimizer
. The Embedding
layer will be stored on the parameter server.
import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()
optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)
model = embed.distributed_model(model)
Here, embed.distributed_optimizer
is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function embed.distributed_model
is to replace the Embedding
layers in the model and override the methods to support saving and loading with parameter servers. Method Embedding.call
will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.
Data parallelism by Horovod.
model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)
Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.
if hvd.rank() == 0:
# Must specify include_optimizer=False explicitly
model.save_as_original_model('model_path', include_optimizer=False)
More examples as follows.
- Replace
Embedding
layer - Transform network model
- Custom subclass model
- With MirroredStrategy
- With MultiWorkerMirroredStrategy and MPI
docker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .
The compiler needs to be compatible with tf.version.COMPILER_VERSION
(>= 7), and install all prpc dependencies to tools
or /usr/local
, and then run build.sh
to complete the compilation. The build.sh
will automatically install prpc (pico-core) and parameter-server (pico-ps) to the tools
directory.
git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz
TensorFlow 2
dtype
:float32
,float64
.tensorflow.keras.initializers
RandomNormal
,RandomUniform
,Constant
,Zeros
,Ones
.- The parameter
seed
is currently ignored.
tensorflow.keras.optimizers
Adadelta
,Adagrad
,Adam
,Adamax
,Ftrl
,RMSprop
,SGD
.decay
andLearningRateSchedule
are not supported.Adam(amsgrad=True)
is not supported.RMSProp(centered=True)
is not supported.- The parameter server uses a sparse update method, which may cause different training results for the
Optimizer
with momentum.
tensorflow.keras.layers.Embedding
- Support array for known
input_dim
and hash table for unknowninput_dim
(2**63 range). - Can still be stored on workers and use dense update method.
- Should not use
embeddings_regularizer
,embeddings_constraint
.
- Support array for known
tensorflow.keras.Model
- Can be converted to distributed
Model
and automatically ignore or convert incompatible settings (such asembeddings_constraint
). - Distributed
save
,save_weights
,load_weights
andModelCheckpoint
. - Saving the distributed
Model
as a stand-alone SavedModel, which can be load by TensorFlow Serving. - Do not support training multiple distributed
Model
s in one task.
- Can be converted to distributed
- Can collaborate with Horovod. Training with
MirroredStrategy
orMultiWorkerMirroredStrategy
is experimental.
- Improve performance
- Support PyTorch training
- Support
tf.feature_column.embedding_column
- Approximate
embedding_regularizer
,LearningRateSchedule
and etc. - Improve the support for
Initializer
andOptimizer
- Training multiple distributed
Model
s in one task - Support ONNX
- Yiming Liu (liuyiming@4paradigm.com)
- Yilin Wang (wangyilin@4paradigm.com)
- Cheng Chen (chencheng@4paradigm.com)
- Guangchuan Shi (shiguangchuan@4paradigm.com)
- Zhao Zheng (zhengzhao@4paradigm.com)
Currently, the interface for persistent memory is experimental. PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.
- OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory. Cheng Chen, Yilin Wang, Jun Yang, Yiming Liu, Mian Lu, Zhao Zheng, Bingsheng He, Weng-Fai Wong, Liang You, Penghao Sun, Yuping Zhao, Fenghua Hu, and Andy Rudoff. In 2023 IEEE 39rd International Conference on Data Engineering (ICDE) 2023.