-
Notifications
You must be signed in to change notification settings - Fork 11
/
README.datawarp
353 lines (252 loc) · 14.5 KB
/
README.datawarp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# -*- Mode: sh; sh-basic-offset:2 ; indent-tabs-mode:nil -*-
#
# Copyright (c) 2017 Los Alamos National Security, LLC. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
Notes on using DataWarp on LANL Cray systems
============================================
Last Update 27 March 2017
Cornell Wright
DataWarp is a Cray product that provides a high speed storage device based on
SSD technologies that is more closely integrated with the compute nodes on an
HPC system than a traditional parallel file system (PFS).
A full description of DataWarp is available in the documents:
XC(TM) Series DataWarp(TM) User Guide (CLE 6.0.UP02) S-2558 (CLE 6.0.UP02.) and
XC(TM) Series DataWarp(TM) Installation and Administration Guide (CLE 6.0.UP02) S-2564 (CLE 6.0.UP02.)
both of which are available on the web site: https://pubs.cray.com.
A Brief Description
-------------------
DataWarp is implemented as a set of service nodes, each of which has about 6
TiB of SSD installed. When requested by a directive in a job script, a
DataWarp filesystem is initialized on some or all of the DataWarp nodes and
made available to the job's compute nodes via a mount on each compute node.
The DataWarp filesysem can operate in a variety of modes, including both
scratch and cache; this document will focus only on "striped scratch" mode
which is suitable for checkpoints.
In addition to the job script directive allocating the DataWarp filesystem,
additional directives are supported to cause pre-job stage-in of files or
post-job stage-out of files between DataWarp and the PFS.
Examples of those directives:
#DW jobdw type=scratch access_mode=striped capacity=100GiB
#DW stage_in destination=$DW_JOB_STRIPED/<dir> source=<pfs_dir> type=directory
#DW stage_out source=$DW_JOB_STRIPED/<dir> destination=<pfs_dir> type=directory
See note under known problems about #DW directives.
DataWarp Usage
--------------
When a DataWarp filesystem has been allocated, its mount point can be located
via environment variable DW_JOB_STRIPED. This variable is set both on the
internal login node and on all the compute nodes. DataWarp itself is only
accessible from the compute nodes. All access to DataWarp must be from a
compute node. If a job script needs to access DataWarp, it needs to use aprun.
For example, to make a directory in DataWarp, issue the command:
aprun -n 1 mkdir $DW_JOB_STRIPED/subdir
The compute nodes see DataWarp as a normal POSIX compatible filesystem. In
addition to the POSIX functionality, files and directories can be transferred
between DataWarp and the PFS independent of the compute nodes. These transfers
can be invoked as mentioned above by job script directives or by DataWarp API
calls made by the application. They are documented in the User Guide.
When a DataWarp job is submitted, 3 additional jobs are automatically submitted
by Moab. They are called: stage-in, stage-out and coordinating. The message
printed to stdout by msub that normally contains the job ID, for a DataWarp job
contains the stage-in, compute and stage-out job IDs in that order.
Like this:
msub /lustre/ttscratch1/cornell/run_dw_simple_20170111.130847/dw_simple_job.sh
55983.datawarp-stagein 55984 55983.datawarp-stageout
The additional jobs are submitted by a Moab function, the msub DataWarp filter.
It logs its actions to the file ~/ac_datawarp.log in the user's home directory.
The state of DataWarp and all of a user's allocations can be displayed by the
dwstat command (documented in the User Guide.) An example:
module load dws
dwstat most
pool units quantity free gran
wlm_pool bytes 34.88TiB 34.69TiB 32GiB
sess state token creator owner created expiration nodes
20 CA--- 45722 MOAB-TORQUE 4611 2016-12-05T13:42:20 never 0
21 CA--- 45716 MOAB-TORQUE 4611 2016-12-05T13:42:20 never 0
inst state sess bytes nodes created expiration intact label public confs
20 CA--- 20 32GiB 1 2016-12-05T13:42:20 never true I20-0 false 1
21 CA--- 21 32GiB 1 2016-12-05T13:42:20 never true I21-0 false 1
conf state inst type activs
22 CA--- 20 scratch 0
23 CA--- 21 scratch 0
The libHIO package (described below) contains a script, dw_simple_sub.sh, which
will submit a simple DataWarp test job. This can be used to familiarize
oneself with the process of creating and submitting a DataWarp job.
An application team could choose to use DataWarp directly by issuing POSIX IO
calls and initiating stage-in and stage-out either via job script directives or
via DataWarp API calls. Alternatively, the LANL developed, open source package
libHIO could be used. See below for more information on libHIO.
libHIO
------
libHIO is a flexible, high-performance parallel IO package developed at LANL.
It has been released as open source and is available at:
https://github.com/hpc/libhio
libHIO supports IO to either a conventional PFS or to DataWarp with management
of DataWarp space and stage-in and stage-out from and to the PFS.
For more information on using libHIO, see the github package, in particular:
README
libhio_api.pdf
hio_example.c
LANL Systems with DataWarp
--------------------------
System DW Nodes Capacity Theoretical
Speed
-------------- -------- -------- -----------
Trinity Ph 1 300 1743 TiB 1589 GiB/S
Trinity Ph 2 234 1360 TiB 1239 GiB/S
Trinity Combined 576 3347 TiB 3050 GiB/S
Trinitite 6 35 TiB 32 GiB/S
As of the writing of this document, actual peak speeds are running
at about 90% of theoretical for N-N and at about 75% for N-1.
Using DWStat and DWCli:
-----------------------
The dwcli command line tool is the easiest way to stage data out during the
job script (but not during the application). The path handling is a little
tricky as you do not use real burst buffer paths. Rather you use relative
paths that begin with slash. Here is an example of dwcli invocation that
I think makes the entire concept clear:
# Retrieve only your own session id and configuration id
dwsessid=$(dwstat sessions |grep $((ACDW_JOBID+1)) |awk '{print $1}')
dwconfid=$(dwstat configurations |awk -v sess=$dwsessid '{if ($3 == sess) print $1}')
# Create a stageout directory, bb directory, and write stuff
mkdir -p /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/bbdat
aprun -n 1 mkdir -p $DW_JOB_STRIPED/lustre/ttscratch1/bws/job.81111.tt-drm/jobdir
< Write some data into $DW_JOB_STRIPED/lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/*>
dwcli stage out --session $dwsessid --configuration $dwconfid --backing-path /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/bbdat/ --dir /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir
Note 1: The stage out is asynchronous. To wait for the staging to complete
use a loop similar to this:
status=""
count=0
while [ "-" != "$status" -a $count < 100 ]; do
sleep 30
dwcli stage query --session $dwsessid --configuration $dwconfid
status=$(dwcli stage query --session $dwsessid --configuration $dwconfid |awk '{ print $8 }')
count=$((count++))
echo "Files remaining to be staged: <$status>"
done
Note 2: The backing path must end with a slash. Note that the --dir is
relative to $DW_JOB_STRIPED, and must start with a slash. This is
non-intuitive, but does make some degree of sense.
Note 3: Stage in is similar. You do not specify $DW_JOB_STRIPED for --dir,
that part is implied.
Known Problems and Gotchas:
---------------------------
Many of the problems seen by DataWarp users are related to the msub DataWarp
filter. This is a program that runs during the execution of the msub command.
The msub DataWarp filter is responsible for recognizing #DW directives in the
job script and using their information to launch the 3 auxiliary DataWarp jobs
that manage DataWarp allocation, deallocation, stage-in and stage-out. The
difficulties arise due to the fact that the msub DataWarp filter runs in the
user's context and can be sensitive to user specific factors such as the
environment and SSH configuration.
The msub DataWarp filter logs to a file in each user's home directory named
~/ac_datawarp.log. This log is a good starting point for debugging msub
related DataWarp problems. It can also be deleted when not needed to free up
disk space.
1) Proxy Environment Variables
------------------------------
Proxy environment variables, e.g., HTTP_PROXY, HTTPS_PROXY, can cause problems
with job submission because they interfere with some of Moab and/or DataWarp's
communications. If you use these settings, unset them before submitting a
DataWarp job. This should be corrected soon.
2) SSH Communications Problems
------------------------------
Sometimes there is an ssh based communication problem between the msub command
and the DataWarp service. Cray has plans to move that interface to a "restful
interface" but for the moment it's operating over ssh. Sometimes if keys
have changed, the underlying ssh fails. The (rather generic) symptom is an
error message from msub:
DataWarp commands in job script failed validation
To diagnose the problem, look in ~/ac_datawarp.log, the last 4 lines will be
something like:
2017-02-07 17:15:50,617 pid:45756 ac_dw_submitfilter common.py:run_cmd:178 DEBUG Executing Popen with args: ["/opt/cray/elogin/eswrap/default/bin/dw_wlm_cli", "-f", "job_process", "--job", "/users/cornell/pgm/hio/ga-gnu/libhio-1.4.0/test/run/job.20170206.094549.724988764.sh"]
2017-02-07 17:16:25,902 pid:45756 ac_dw_submitfilter dw_cli_commands.py:dw_job_process:304 ERROR DW dw_job_process command failed with rc = 1
2017-02-07 17:16:25,902 pid:45756 ac_dw_submitfilter ac_dw_submitfilter:main:664 ERROR DataWarp commands in job script failed validation in DW API job_process method
2017-02-07 17:16:25,903 pid:45756 ac_dw_submitfilter ac_dw_submitfilter:main:666 INFO END, rc = 0
The popen command on the first line may be the source of the problem.
Removing the python Popen parameter escapes, it converts to:
/opt/cray/elogin/eswrap/default/bin/dw_wlm_cli -f job_process -job /users/cornell/pgm/hio/ga-gnu/libhio-1.4.0/test/run/job.20170206.094549.724988764.sh
Perform the equivalent conversion on the command from your ~/ac_datawarp.log
file and run the command by hand and see if there are any SSH related errors.
If it is a key problem, it can be corrected by by removing the incorrect entry
in ~/.ssh/known_hosts file.
3) $DW_JOB_STRIPED string in #DW stage_in and stage_out
-------------------------------------------------------
The string $DW_JOB_STRIPED in the #DW stage_in and stage_out directives is a
required string. It is not a shell variable. All of the #DW directives are
processed during the msub command and shell variables will not be expanded. I
speculate that the $DW_JOB_STRIPED string is a placeholder for a future
capability, but for now, it must be coded exactly as shown in the DataWarp User
Guide.
4) #DW stage_in type=file destination
-------------------------------------
The "#DW stage_in type=file" directive has a required destination argument. As
in the type=dir format of the directive, the destination must start with the
literal string "$DW_JOB_STRIPED". It must be followed by a file name or a
subdirectory and file name. Examples:
#DW stage_in type=file source=<pfs path> destination=$DW_JOB_STRIPED/foo
will copy the specified source file into file name "foo" in the root of
the DataWarp allocation.
#DW stage_in type=file source=<pfs path> destination=$DW_JOB_STRIPED/sub/foo
will create subdirectory "sub" in the root of the DataWarp allocation and
will copy the specified source file into file name "foo" in that
subdirectory.
5) #DW stage_in type=list list file format
------------------------------------------
The list file specified on a "#DW stage_in type=list" directive contains
source and destination on each line. Similar to the format of the
destination argument of the type=file stage_in directive described above,
destination must start with the literal string "$DW_JOB_STRIPED". It must
be followed by a file name or a subdirectory and file name. Examples list
file lines:
<pfs path> $DW_JOB_STRIPED/foo
will copy the specified source file into file name "foo" in the root of
the DataWarp allocation.
<pfs path> $DW_JOB_STRIPED/sub/foo
will create subdirectory "sub" in the root of the DataWarp allocation and
will copy the specified source file into file name "foo" in that
subdirectory.
6) Stage_in errors poorly messaged
----------------------------------
When there is an error processing a stage_in directive, there is no indication
given in the job's output log. The datawarp-stagein job will run, the other
3 sub-jobs will be cancelled and there will be no job output file. From a
user's perspective, the job will have disappeared without producing an output
file. A generic indication of the failure is available from the checkjob -v
command. Example:
-->msub ga_home_1.sh
53044.datawarp-stagein 53045 53044.datawarp-stageout
-->checkjob -v 53045
job 53045
. . . .
Message[0] Scheduling DataWarp Storage
Message[1] DW Staging running
Message[2] DataWarp StageIn failed
7) JSON Error - eswrap module
-----------------------------
One user had added a "module load eswrap" to their .bashrc file. This caused
the following error message from msub:
ERROR: Error encountered: No JSON object could be decoded DataWarp request
recognized, but errors were encountered processing DW directives.
There were also other eswrap related error messages when the dwstat command was
used. Removing the module load corrected the problem.
This is due to the fact that the msub DataWarp filter uses the eswrap
communications mechanism and other paths that are similar to what is used by
the dwstat command. So, a good debugging step is to make sure the dwstat
command is working correctly before submitting a DataWarp job with msub.
8) Finding Python
-----------------
The msub DataWarp filter is written in Python 2 and runs in the user's context
while the msub command is running. Normally, msub is configured to invoke the
system Python interpreter via a fully qualified path name, typically
/usr/bin/python. There have been instances when instead it was configured to
run just the unqualified executable named python. In that case, changing the
default to Python 3, possibly via a "module load python" command will cause
problems. Additionally, having a file or directory named "python" on the
current PATH (particularly if the PATH contains ".") will also cause problems.
--- end of README.datawarp ---