十二、分布式 TensorFlow 和云深度学习

在本章中，我们将讨论使用分布式 TensorFlow 和云深度学习。我们将为您提供一些秘籍：

使用 TensorFlow 和 GPU
使用分布式 TensorFlow：多个 GPU 和一个 CPU
使用分布式 TensorFlow：多台服务器
训练分布式 TensorFlow MNIST 分类器
使用 TensorFlow Serving 和 Docker
使用 Compute Engine 在 Google Cloud（GCP）上运行分布式 TensorFlow
在 Google CloudML 上运行分布式 TensorFlow
在 Microsoft Azure 上运行分布式 TensorFlow
在 Amazon AWS 上运行分布式 TensorFlow

介绍

每个 TensorFlow 计算均以图描述。这允许在结构和操作的放置方面具有自然的灵活性，可以在计算的分布式节点之间拆分操作。该图可以拆分为多个子图，这些子图分配给服务器集群中的不同节点。

我强烈建议读者看一看大型分布式深度网络（Jeffrey Dean，Greg S.Corrado，Rajat Monga，Kai Chen，Matthieu Devin，Quoc V.Le，Mark Z.Mao，Marc'Aurelio Ranzato，Andrew Senior ，Paul Tucker，Ke Yang 和 Andrew Y. Ng。 NIPS，2012 年）

本文的一项关键结果是证明可以运行分布式随机梯度下降（SGD），其中多个节点在数据碎片上并行工作，并且通过将更新发送到参数服务器，来实现梯度的独立和异步更新。引用论文摘要：

我们的实验揭示了关于大规模非凸优化的几个令人惊讶的结果。首先，异步 SGD 很少用于非凸问题，对于训练深度网络非常有效，特别是与 Adagrad 自适应学习率结合使用时。

下图取自论文本身，很好地说明了这一点：

来自参数服务器的分布式梯度下降示例

您应该阅读的另一份文档是白皮书《TensorFlow：异构分布式系统上的大规模机器学习》（MartínAbadi 等人，2015 年 11 月）

考虑其中包含的一些示例，我们可以在下面的图片的左侧看到 TensorFlow 代码的片段，然后将其表示为右侧的图形：

TensorFlow 图的示例

通过进行本地计算并在需要时透明地将远程通信节点添加到图，可以在多个节点之间划分图。下图中对此作了很好的解释，该图仍取自前面提到的论文：

摘自以下内容的分布式 TensorFlow 图计算示例

可以通过集中方式（下图的左侧）或分布式方式（右侧）来计算梯度下降和所有主要的优化器算法。后者涉及一个主进程，该主进程与调配 GPU 和 CPU 的多个工作器进行对话：

摘自以下内容的单机和分布式系统结构示例和分布式 TensorFlow 图计算示例

分布式计算既可以是同步的（所有工作程序都在同时更新分片数据上的梯度），也可以是异步的（更新没有同时发生）。后者通常可以实现更高的可伸缩性，并且在收敛到最佳解决方案方面，较大的图计算仍然可以很好地工作。同样，这些图片均来自 TensorFlow 白皮书，如果您想了解更多信息，我强烈建议有兴趣的读者阅读本文：

同步和异步数据并行训练的示例

使用 TensorFlow 和 GPU

在本秘籍中，我们将学习如何在 GPU 上使用 TensorFlow：执行的操作是在 CPU 或 GPU 上的简单矩阵乘法。

准备

第一步是安装支持 GPU 的 TensorFlow 版本。正式的 TensorFlow 安装说明是您的起点。请记住，您需要一个通过 CUDA 或 CuDNN 支持 GPU 的环境。

操作步骤

我们按以下步骤进行：

首先导入一些模块

import sys
import numpy as np
import tensorflow as tf
from datetime import datetime

从命令行获取要使用的处理单元的类型（"gpu"或"cpu"）

device_name = sys.argv[1] # Choose device from cmd line. Options: gpu or cpu
shape = (int(sys.argv[2]), int(sys.argv[2]))
if device_name == "gpu":
  device_name = "/gpu:0"
else:
  device_name = "/cpu:0"

在 GPU 或 CPU 上执行矩阵乘法。按键说明与tf.device(device_name)一起使用。它创建一个新的上下文管理器，告诉 TensorFlow 在 GPU 或 CPU 上执行这些操作

with tf.device(device_name):
  random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
  dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
  sum_operation = tf.reduce_sum(dot_operation)

startTime = datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
  result = session.run(sum_operation)
  print(result)

4.打印一些调试时间，只是为了验证 CPU 和 GPU 有什么区别

print("Shape:", shape, "Device:", device_name)
print("Time taken:", datetime.now() - startTime)

工作原理

本秘籍说明了如何将 TensorFlow 计算分配给 CPU 或 GPU。该代码非常简单，它将用作下一个秘籍的基础。

使用分布式 TensorFlow：多个 GPU 和一个 CPU

我们将展示一个数据并行化的示例，其中数据被拆分到多个 GPU 中

准备

此秘籍的灵感来自尼尔·特内霍尔茨（Neil Tenenholtz）撰写的一篇不错的博客文章，该文章可在线获取

操作步骤

我们按以下步骤进行：

考虑这段在单个 GPU 上运行矩阵乘法的代码。

# single GPU (baseline)
import tensorflow as tf
# place the initial data on the cpu
with tf.device('/cpu:0'):
  input_data = tf.Variable([[1., 2., 3.],
    [4., 5., 6.],
    [7., 8., 9.],
    [10., 11., 12.]])
b = tf.Variable([[1.], [1.], [2.]])

# compute the result on the 0th gpu
with tf.device('/gpu:0'):
  output = tf.matmul(input_data, b)

# create a session and run
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print sess.run(output)

如图 2 所示，在 2 个不同的 GPU 之间使用代码复制对代码进行分区。请注意，CPU 充当主节点来分发图并收集最终结果。

# in-graph replication
import tensorflow as tf
num_gpus = 2
# place the initial data on the cpu
with tf.device('/cpu:0'):
  input_data = tf.Variable([[1., 2., 3.],
   [4., 5., 6.],
   [7., 8., 9.],
   [10., 11., 12.]])
   b = tf.Variable([[1.], [1.], [2.]])

# split the data into chunks for each gpu
inputs = tf.split(input_data, num_gpus)
outputs = []

# loop over available gpus and pass input data
for i in range(num_gpus):
  with tf.device('/gpu:'+str(i)):
    outputs.append(tf.matmul(inputs[i], b))

# merge the results of the devices
with tf.device('/cpu:0'):
  output = tf.concat(outputs, axis=0)

# create a session and run
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print sess.run(output)

工作原理

这是一个非常简单的方法，其中通过将 CPU 用作主对象并将其分配到两个用作分布式工作程序的 GPU，将图分为两部分。计算结果被收集回 CPU。

使用分布式 TensorFlow：多台服务器

在本秘籍中，我们将学习如何在多个服务器之间分配 TensorFlow 计算。关键假设是工作程序和参数服务器的代码都相同。因此，每个计算节点的角色都传递给命令行参数。

准备

同样，此秘籍的灵感来自尼尔·特纳霍尔茨（Neil Tenenholtz）撰写的一篇不错的博客文章，该文章可在线获取

操作步骤

我们按以下步骤进行：

考虑这段代码，在此我们指定集群架构，其中一个主服务器运行在192.168.1.1:1111上，两个工作服务器分别运行在192.168.1.2:1111和192.168.1.3:1111上。

import sys
import tensorflow as tf

# specify the cluster's architecture
cluster = tf.train.ClusterSpec({'ps': ['192.168.1.1:1111'],
     'worker': ['192.168.1.2:1111',
     '192.168.1.3:1111']
})

请注意，代码是在多台机器上复制的，因此了解当前执行节点的作用很重要。我们从命令行获得此信息。机器可以是工作程序，也可以是参数服务器（ps）。

# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list
# as defined in the ClusterSpec

在给定群集的情况下运行训练服务器，我们为每个计算赋予一个角色（工作者或ps）和一个 ID。

# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)

根据特定计算节点的角色，计算是不同的：
- 如果角色是参数服务器，则条件是要加入服务器。请注意，在这种情况下，没有代码要执行，因为工作进程将不断推送更新，并且参数服务器唯一要做的就是等待。
- 否则，工作程序代码将在集群中的特定设备上执行。这部分代码类似于在一台机器上执行的代码，在该机器上我们首先构建模型，然后在本地对其进行训练。请注意，所有工作分配和更新结果的收集都是由 Tensoflow 透明完成的。请注意，TensorFlow 提供了方便的tf.train.replica_device_setter，可自动将操作分配给设备。

# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
  server.join()
else:
  # workers only
  with tf.device(tf.train.replica_device_setter(
    worker_device='/job:worker/task:'+task_idx,
    cluster=cluster)):
# build your model here as if you only were using a single machine

with tf.Session(server.target):
  # train your model here

工作原理

在本秘籍中，我们已经看到了如何创建具有多个计算节点的集群。节点既可以扮演参数服务器的角色，也可以扮演工作者的角色。

在这两种情况下，根据从命令行收集的参数，执行的代码相同，但是代码的执行不同。参数服务器仅需要等待，直到工作器发送更新。请注意，tf.train.replica_device_setter(..)的作用是将操作自动分配给可用设备，而tf.train.ClusterSpec(..)则用于集群设置。

训练分布式 TensorFlow MNIST 分类器

此秘籍用于以分布式方式训练完整的 MNIST 分类器。该秘籍的灵感来自这个页面中的博客文章，此处提供了在 TensorFlow 1.2 上运行的代码

准备

此秘籍基于上一个秘籍。因此，按顺序阅读它们可能会很方便。

操作步骤

我们按以下步骤进行：

导入一些标准模块并定义运行计算的 TensorFlow 集群。然后为特定任务启动服务器

import tensorflow as tf
import sys
import time
# cluster specification
parameter_servers = ["pc-01:2222"]
workers = [ "pc-02:2222",
"pc-03:2222",
"pc-04:2222"]
cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
# input flags
tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")FLAGS = tf.app.flags.FLAGS
# start a server for a specific task
server = tf.train.Server(
  cluster,
  job_name=FLAGS.job_name,
  task_index=FLAGS.task_index)

读取 MNIST 数据并定义用于训练的超参数

# config
batch_size = 100
learning_rate = 0.0005
training_epochs = 20
logs_path = "/tmp/mnist/1"
# load mnist data set
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

检查您的角色是参数服务器还是工作器。如果工作器随后定义了一个简单的密集神经网络，则定义一个优化器以及用于评估分类器的度量（例如准确率）。

if FLAGS.job_name == "ps":
  server.join()
elif FLAGS.job_name == "worker":
# Between-graph replication
with tf.device(tf.train.replica_device_setter(
  worker_device="/job:worker/task:%d" % FLAGS.task_index,
  cluster=cluster)):
# count the number of updates
  global_step = tf.get_variable( 'global_step', [], initializer =      tf.constant_initializer(0),
trainable = False)

  # input images
  with tf.name_scope('input'):
    # None -> batch size can be any size, 784 -> flattened mnist image
    x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input")
    # target 10 output classes
    y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")

  # model parameters will change during training so we use tf.Variable
  tf.set_random_seed(1)
  with tf.name_scope("weights"):
    W1 = tf.Variable(tf.random_normal([784, 100]))
    W2 = tf.Variable(tf.random_normal([100, 10]))

  # bias
  with tf.name_scope("biases"):
    b1 = tf.Variable(tf.zeros([100]))
    b2 = tf.Variable(tf.zeros([10]))

  # implement model
  with tf.name_scope("softmax"):
    # y is our prediction
    z2 = tf.add(tf.matmul(x,W1),b1)
    a2 = tf.nn.sigmoid(z2)
    z3 = tf.add(tf.matmul(a2,W2),b2)
    y = tf.nn.softmax(z3)

  # specify cost function
  with tf.name_scope('cross_entropy'):
    # this is our cost
    cross_entropy = tf.reduce_mean(
-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

  # specify optimizer
  with tf.name_scope('train'):
    # optimizer is an "operation" which we can execute in a session
    grad_op = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = grad_op.minimize(cross_entropy, global_step=global_step)

  with tf.name_scope('Accuracy'):
    # accuracy
    correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

  # create a summary for our cost and accuracy
  tf.summary.scalar("cost", cross_entropy)
  tf.summary.scalar("accuracy", accuracy)
  # merge all summaries into a single "operation" which we can execute in a session
  summary_op = tf.summary.merge_all()
  init_op = tf.global_variables_initializer()
  print("Variables initialized ...")

启动一个主管，该主管充当分布式设置的首席计算机。首要的是负责管理集群其余所有部分的工作机。会话由负责人维护，关键指令为sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0))。同样，使用prepare_or_wait_for_session(server.target)，主管将等待模型准备就绪可以使用。请注意，每个工作器将处理不同的批量模型，然后最终模型可用于主管。

sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
begin_time = time.time()
frequency = 100
with sv.prepare_or_wait_for_session(server.target) as sess:
  # create log writer object (this will log on every machine)
  writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
  # perform training cycles
  start_time = time.time()
  for epoch in range(training_epochs):
    # number of batches in one epoch
    batch_count = int(mnist.train.num_examples/batch_size)
    count = 0
    for i in range(batch_count):
      batch_x, batch_y = mnist.train.next_batch(batch_size)
      # perform the operations we defined earlier on batch
      _, cost, summary, step = sess.run(
     [train_op, cross_entropy, summary_op, global_step],
     feed_dict={x: batch_x, y_: batch_y})
     writer.add_summary(summary, step)
     count += 1
     if count % frequency == 0 or i+1 == batch_count:
       elapsed_time = time.time() - start_time
       start_time = time.time()
       print("Step: %d," % (step+1),
        " Epoch: %2d," % (epoch+1), " Batch: %3d of %3d," % (i+1, batch_count),
       " Cost: %.4f," % cost,
       "AvgTime:%3.2fms" % float(elapsed_time*1000/frequency))
       count = 0
    print("Test-Accuracy: %2.2f" % sess.run(accuracy, feed_dict={x:  mnist.test.images, y_: mnist.test.labels}))
    print("Total Time: %3.2fs" % float(time.time() - begin_time))
    print("Final Cost: %.4f" % cost)
  sv.stop()
  print("done")

工作原理

本秘籍描述了分布式 MNIST 分类器的示例。在此示例中，TensorFlow 允许我们定义三台机器的集群。一台充当参数服务器，另外两台机器用作工作器，负责处理不同批次的训练数据。

使用 TensorFlow Serving 和 Docker

在本秘籍中，我们将展示如何为 TensorFlow Serving 运行 Docker 容器，这是一组组件以导出经过训练的 TensorFlow 模型并使用标准tensorflow_model_server为其提供服务。 TensorFlow Serving 服务器发现新导出的模型并运行 gRPC 服务来为其服务。

准备

我们将使用 Docker，并假设您熟悉该系统。如果没有，请确保查看并安装。我们要做的是构建 TF 服务版本。

操作步骤

这是我们进行秘籍的方法：

从这里下载Dockerfile.devel
通过运行以下来构建容器

docker build --pull -t $USER/tensorflow-serving-devel -f Dockerfile.devel

运行容器

docker run -it $USER/tensorflow-serving-devel

克隆 TensorFlow 服务，配置和测试服务器

git clone --recurse-submodules https://github.com/tensorflow/serving
cd serving/tensorflow
./configure
cd ..
bazel test tensorflow_serving/...

现在，让我们看一个保存模型的示例，以便服务器可以保存它。此步骤的灵感来自用于构建 MNIST 训练器和服务模型的示例。第一步是将构建器导入为saved_model_builder。然后，大部分工作由SavedModelBuilder()完成，该工作将已训练模型的快照保存到可靠的存储中。请注意，此处export_path是/tmp/mnist_model/

from tensorflow.python.saved_model import builder as saved_model_builder
...
export_path_base = sys.argv[-1]
export_path = os.path.join(
  compat.as_bytes(export_path_base),
  compat.as_bytes(str(FLAGS.model_version)))
print 'Exporting trained model to', export_path
builder = saved_model_builder.SavedModelBuilder(export_path)
builder.add_meta_graph_and_variables(
  sess, [tag_constants.SERVING],
  signature_def_map={
    'predict_images':
    prediction_signature,
    signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
    classification_signature,
  },
  legacy_init_op=legacy_init_op)
builder.save()

然后可以通过简单的命令为模型提供服务

tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/tmp/mnist_model/

工作原理

Google 在 2016 年 2 月发布了 TensorFlow Serving，这是一种针对机器学习模型的高性能服务系统，专为生产环境而设计。截至 2017 年 8 月，谷歌内部有 800 多个使用 TensorFlow Serving 的项目投入生产。

使用 Compute Engine 在 Google Cloud（GCP）上运行分布式 TensorFlow

在本秘籍中，我们将学习如何在 Google Cloud（GCP）上使用 Google Tensorflow。我们将要审查的示例是经典 MNIST。

准备

看看 GCP 在 GoogleCloud 中的工作方式会很好。请注意，GCP 提供 300 美元的免费赠金，以开始使用任何 GCP 产品。此外，在免费试用期间和之后，某些产品针对符合条件的客户可以免费使用。（优惠如有更改，请参见这个页面。

操作步骤

这是我们进行秘籍的方法：

通过网络控制台创建一个新的 Google Cloud 项目。

单击创建项目时，将显示以下屏幕：

通过选择控制台左侧栏上的相关语音，为该项目启用计费。然后为该项目启用 Compute Engine 和 Cloud Machine Learning API：

登录到网络 cloudshell。

从控制台运行以下命令，以配置将在其中执行计算的区域，下载示例代码以及创建用于运行代码的 VM。最后连接到机器：

gcloud config set compute/zone us-east1-c
gcloud config set project [YOUR_PROJECT_ID]
git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example
cd cloudml-dist-mnist-example
gcloud compute instances create template-instance \
 --image-project ubuntu-os-cloud \
 --image-family ubuntu-1604-lts \
 --boot-disk-size 10GB \
 --machine-type n1-standard-1
gcloud compute ssh template-instance

现在，登录到机器后，我们需要通过使用这些命令安装 PIP 和 TensorFlow 来设置环境。

sudo apt-get update
sudo apt-get -y upgrade \
 && sudo apt-get install -y python-pip python-dev
sudo pip install tensorflow
sudo pip install --upgrade tensorflow

我们将有多个工作器对 MNIST 数据进行操作，因此最好的方法是创建一个在所有工作器之间共享的存储桶，并在该存储桶中复制 MNIST 数据

BUCKET="mnist-$RANDOM-$RANDOM"
gsutil mb -c regional -l us-east1 gs://${BUCKET}
sudo ./scripts/create_records.py
gsutil cp /tmp/data/train.tfrecords gs://${BUCKET}/data/
gsutil cp /tmp/data/test.tfrecords gs://${BUCKET}/data/

现在，我们将创建多个工作程序（worker-0，worker-1），它们是初始模板实例计算机的克隆。我们不希望计算机在关闭时删除磁盘，因此这就是我们拥有第一个命令的原因。

gcloud compute instances set-disk-auto-delete template-instance \
 --disk template-instance --no-auto-delete
gcloud compute instances delete template-instance
gcloud compute images create template-image \
 --source-disk template-instance
gcloud compute instances create \
 master-0 worker-0 worker-1 ps-0 \
 --image template-image \
 --machine-type n1-standard-4 \
 --scopes=default,storage-rw

最后一步是运行用于分布式训练的计算。

./scripts/start-training.sh gs://${BUCKET}

工作原理

演示脚本将代码移动到每个 VM 并开始分布式计算。这两个工作程序在公共存储桶上共享的相同 MNIST 数据上并行运行。计算结束后，脚本将打印训练后的模型的位置。

在 Google CloudML 上运行分布式 TensorFlow

CloudML 是 Google 运行的 Tensorflow 的托管版本。无需自己运行 TF，您只需使用 CloudML 即可，而无需考虑与基础架构和可伸缩性相关的所有问题。

准备

在这里，我们假设您已经创建了一个 Cloud Platform Project，为您的项目启用计费，并启用了 Google Compute Engine 和 Cloud Machine Learning API。这些步骤类似于先前秘籍中描述的步骤。此秘籍的灵感来自这个页面中提供的 MNIST 训练代码。

操作步骤

我们继续在 Google CloudML 上运行分布式 TensorFlow：

第一步只是下载示例代码

git clonehttps://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example
cd cloudml-dist-mnist-example

然后，我们下载数据并将其保存在 GCP 存储桶中

PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET="${PROJECT_ID}-ml"
gsutil mb -c regional -l us-central1 gs://${BUCKET}
./scripts/create_records.py
gsutil cp /tmp/data/train.tfrecords gs://${BUCKET}/data/
gsutil cp /tmp/data/test.tfrecords gs://${BUCKET}/data/

提交训练工作非常简单：我们可以使用 CloudML 引擎简单地调用训练步骤。在此示例中，训练人员代码在us-central1区域中运行 1000 次迭代。输入数据是从存储桶中提取的，而输出桶将被提交到其他存储桶中。

JOB_NAME="job_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training ${JOB_NAME} \
 --package-path trainer \
 --module-name trainer.task \
 --staging-bucket gs://${BUCKET} \
 --job-dir gs://${BUCKET}/${JOB_NAME} \
 --runtime-version 1.2 \
 --region us-central1 \
 --config config/config.yaml \
 -- \
 --data_dir gs://${BUCKET}/data \
 --output_dir gs://${BUCKET}/${JOB_NAME} \
 --train_steps 10000

如果需要，您可以通过访问这个页面中的 CloudML 控制台来控制训练过程。
训练结束后，就可以直接从 CloudML 服务模型了

MODEL_NAME=MNIST
gcloud ml-engine models create --regions us-central1 ${MODEL_NAME}
VERSION_NAME=v1
ORIGIN=$(gsutil ls gs://${BUCKET}/${JOB_NAME}/export/Servo | tail -1)
gcloud ml-engine versions create \
 --origin ${ORIGIN} \
 --model ${MODEL_NAME} \
${VERSION_NAME}
gcloud ml-engine versions set-default --model ${MODEL_NAME} ${VERSION_NAME}

在线提供模型后，便可以访问服务器并做出预测。通过使用脚本make_request.py创建request.json，该脚本从 MNIST 读取数据，执行一次热编码，然后使用格式正确的 json 模式编写功能。

gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json

工作原理

CloudML 是使用 Google 运行的托管版本的 TensorFlow 的便捷解决方案。除了直接照顾基础设施和运营，还可以直接专注于开发机器学习模型。

在 Microsoft Azure 上运行分布式 TensorFlow

Microsoft Azure 提供了一项名为 Batch AI 的服务，该服务使我们能够在 Azure 虚拟机的群集上运行机器学习模型。

准备

第一步，我们需要一个 Azure 帐户：如果您还没有 Azure 帐户，则可以在此处免费创建一个帐户。 Azure 向新用户提供为期 30 天的 200 美元信用额度。此秘籍将遵循 Microsoft Azure 提供的示例，以使用 TensorFlow 分布式在两个 GPU 上运行 MNIST，相关代码可在 Github 上获得

操作步骤

我们按以下步骤进行：

第一步是安装 Azure CLI。可在此处获得在不同 OS 平台上安装 Azure CLI 的详细信息。
在创建群集之前，您需要使用命令az login登录 Azure。它将生成一个代码，并为您提供一个网站地址，将要求您在其中验证您的凭据，一旦网站上的所有步骤都结束，将要求您关闭并验证您的 az 凭据。
配置默认位置，创建和配置资源组。

az group create --name myResourceGroup --location eastus
az configure --defaults group=myResourceGroup
az configure --defaults location=eastus

接下来，我们将需要使用az命令创建存储，并根据您的操作系统设置环境变量，有关环境变量及其值的详细信息，请参见这里
下载并提取预处理的 MNIST 数据库

wget "https://batchaisamples.blob.core.windows.net/samples/mnist_dataset_original.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=b&sig=Qc1RA3zsXIP4oeioXutkL1PXIrHJO0pHJlppS2rID3I%3D" -O mnist_dataset_original.zip
unzip mnist_dataset_original.zip

下载mnist_replica

wget "https://raw.githubusercontent.com/Azure/BatchAI/master/recipes/TensorFlow/TensorFlow-GPU-Distributed/mnist_replica.py?token=AcZzrcpJGDHCUzsCyjlWiKVNfBuDdkqwks5Z4dPrwA%3D%3D" -O mnist_replica.py

接下来，创建一个 Azure 文件共享，在其中上传下载的 MNIST 数据集和mnist_replica.py文件。

az storage share create --name batchaisample
az storage directory create --share-name batchaisample --name mnist_dataset
az storage file upload --share-name batchaisample --source t10k-images-idx3-ubyte.gz --path mnist_dataset
az storage file upload --share-name batchaisample --source t10k-labels-idx1-ubyte.gz --path mnist_dataset
az storage file upload --share-name batchaisample --source train-images-idx3-ubyte.gz --path mnist_dataset
az storage file upload --share-name batchaisample --source train-labels-idx1-ubyte.gz --path mnist_dataset
az storage directory create --share-name batchaisample --name tensorflow_samples
az storage file upload --share-name batchaisample --source mnist_replica.py --path tensorflow_samples

现在，我们为该秘籍创建一个集群，该集群由具有 standard_NC6 大小的两个 GPU 节点组成，具有标准的 Ubuntu LTS 或 Ubuntu DVSM。可以使用 Azure CLI 命令创建群集：

对于 Linux：

az batchai cluster create -n nc6 -i UbuntuDSVM -s Standard_NC6 --min 2 --max 2 --afs-name batchaisample --afs-mount-path external -u $USER -k ~/.ssh/id_rsa.pub

对于 Windows：

az batchai cluster create -n nc6 -i UbuntuDSVM -s Standard_NC6 --min 2 --max 2 --afs-name batchaisample --afs-mount-path external -u <user_name> -p <password>

下一步是在job.json文件中创建作业创建参数：

{
  "properties": {
    "nodeCount": 2,
    "tensorFlowSettings": {
      "parameterServerCount": 1,
      "workerCount": 2,
      "pythonScriptFilePath": "$AZ_BATCHAI_INPUT_SCRIPT/mnist_replica.py",
      "masterCommandLineArgs": "--job_name=worker --num_gpus=1 --ps_hosts=$AZ_BATCHAI_PS_HOSTS --worker_hosts=$AZ_BATCHAI_WORKER_HOSTS --task_index=$AZ_BATCHAI_TASK_INDEX --data_dir=$AZ_BATCHAI_INPUT_DATASET --output_dir=$AZ_BATCHAI_OUTPUT_MODEL",
      "workerCommandLineArgs": "--job_name=worker --num_gpus=1 --ps_hosts=$AZ_BATCHAI_PS_HOSTS --worker_hosts=$AZ_BATCHAI_WORKER_HOSTS --task_index=$AZ_BATCHAI_TASK_INDEX --data_dir=$AZ_BATCHAI_INPUT_DATASET --output_dir=$AZ_BATCHAI_OUTPUT_MODEL",
      "parameterServerCommandLineArgs": "--job_name=ps --num_gpus=0 --ps_hosts=$AZ_BATCHAI_PS_HOSTS --worker_hosts=$AZ_BATCHAI_WORKER_HOSTS --task_index=$AZ_BATCHAI_TASK_INDEX --data_dir=$AZ_BATCHAI_INPUT_DATASET --output_dir=$AZ_BATCHAI_OUTPUT_MODEL"
},
     "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
     "inputDirectories": [{
     "id": "DATASET",
     "path": "$AZ_BATCHAI_MOUNT_ROOT/external/mnist_dataset"
    }, {
    "id": "SCRIPT",
    "path": "$AZ_BATCHAI_MOUNT_ROOT/external/tensorflow_samples"
    }],
     "outputDirectories": [{
       "id": "MODEL",
       "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
     "pathSuffix": "Models"
   }],
    "containerSettings": {
      "imageSourceRegistry": {
      "image": "tensorflow/tensorflow:1.1.0-gpu"
    }
  }
 }
}

最后，使用以下命令创建 Batch AI 作业：

az batchai job create -n distibuted_tensorflow --cluster-name nc6 -c job.json

工作原理

Batch AI 自行管理资源，您只需要指定作业，输入的位置以及存储输出的位置即可。如果在执行作业期间想要查看结果，则可以使用以下命令：

az batchai job stream-file --job-name myjob --output-directory-id stdouterr --name stderr.txt

作业结束后，您可以使用az batchai job delete和az batchai cluster delete命令删除作业和群集。

在 Amazon AWS 上运行分布式 TensorFlow

Amazon AWS 提供了配备 NVIDIA K8 GPU 的 P2.x 机器。为了能够使用，第一步再次涉及创建一个 Amazon AWS 账户。如果还没有，可以使用以下链接创建它。登录帐户后，仪表板将如下所示：

您可以看到 Amazon AWS 提供了许多服务，但是在这里，我们担心使用 Amazon AWS 进行深度学习。

GPU 仅在 P2 实例创建中可用，并且默认情况下不可用，要获得此服务，您必须通过 AWS 支持筹集更多资源，一旦获得支持，该支持就会显示在右上角，您将看到一个按钮，创建案例，选择按钮并进行以下选择：

您可以选择提供 p2.xlarge 实例的任何区域。新的限制值决定了您可以使用的实例的最大数量，请谨慎选择数量，因为该服务不是免费的，并且每小时的费用约为 1 美元。 AWS 大约需要 24 小时才能响应请求。

准备

现在您已经可以使用 AWS 账户和 p2.xlarge 实例，现在就可以从 EC2 仪表板创建自己的实例了。第一步是选择机器映像，到目前为止，Amazon 提供了预装有深度学习库的特殊机器映像，您可以选择 Ubuntu 或 Linux 版本。接下来，选择 GPU 以计算实例类型。

您可以使用默认参数查看并启动实例，也可以配置设置，选择存储并配置安全组。配置安全组很重要，默认情况下已设置 SSH 安全组，但是如果要使用 Jupyter 笔记本，则需要为端口 8888 添加自定义安全组，可以选择从中登录的源。对于实例，有三种可能性：“自定义”，“随处可见”或“我的 IP”：

最后，在启动实例时，将要求您创建一个密钥对，以允许您登录到特定实例，您将需要创建一个密钥对并下载相应的.pem文件，您将使用文件以便后续记录。

操作步骤

第一步是连接到实例，您可以使用命令行通过 SSH 或浏览器来实现。我们使用 CLI。
要进行连接，首先更改.pem文件的模式。

chmod 400 <security-keypair-filename>.pem

接下来，使用以下命令对实例进行 SSH。选择连接时，确切的地址将在仪表板上可用：

ssh -i " <security-keypair-filename>.pem" ubuntu@ec2-XXXXXXXXXXXXX.compute-1.amazonaws.com

我们选择的机器实例已经包含了所有深度学习库，包括 TensorFlow，因此我们不需要安装任何东西：

每个文件夹都包含一个自述文件，该文件描述了如何使用相应的库：

工作原理

您可以运行在我们创建的实例上学到的代码。工作结束后，不要忘记退出，并从仪表板停止实例。有关价格和使用情况的更多详细信息，请参见这里

Files

12.md

Latest commit

History

12.md

File metadata and controls

十二、分布式 TensorFlow 和云深度学习

介绍

使用 TensorFlow 和 GPU

准备

操作步骤

工作原理

使用分布式 TensorFlow：多个 GPU 和一个 CPU

准备

操作步骤

工作原理

使用分布式 TensorFlow：多台服务器

准备

操作步骤

工作原理

更多

训练分布式 TensorFlow MNIST 分类器

准备

操作步骤

工作原理

使用 TensorFlow Serving 和 Docker

准备

操作步骤

工作原理

更多

使用 Compute Engine 在 Google Cloud（GCP）上运行分布式 TensorFlow

准备

操作步骤

工作原理

更多

在 Google CloudML 上运行分布式 TensorFlow

准备

操作步骤

工作原理

更多

在 Microsoft Azure 上运行分布式 TensorFlow

准备

操作步骤

工作原理

更多

在 Amazon AWS 上运行分布式 TensorFlow

准备

操作步骤

工作原理

更多