forked from SquareandCompass/SystemsCapstone
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
274 lines (239 loc) · 15.8 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
================================================================================
Persistent User Programs: A Checkpoint / Restore Framework
Team SIGSYS(31): Savannah Amos and Matthew Zhong
================================================================================
-----------------------------------------------------------------------------
Q: What does this system need to accomplish? From the top-down, what are the
ways that someone can use this system, and why would they use it?
-----------------------------------------------------------------------------
In this project, our objective is to create a checkpoint-restore framework
which users can integrate into C applications. Using various forms of
persistent memory, this framework will enable a user-level application to
continue computation from the middle of a crash, as if the program never had
crashed in the first place.
Imagine if we had a program which was performing some form of complex
computation over a long period of time, like for example, a machine learning
and training algorithm. Normally, abrupt termination would spell disaster
for the algorithm - we would need to restart the algorithm from the very
beginning and lose all of this progress.
+----------------+ In the diagram on the left, we can see what happens
| bash ./ml_algo | to the execution of a simple program when a crash
| | occurs. The issue is that, after crashing at <1>,
| 0. <STARTED> | we would need to completely restart the program
| | | at <0>, and lose all of our progress.
| V |
| 1. <CRASHED> | A checkpoint-restore algorithm would allow us to
| | pick up the program from where the program crashed
| bash ./ml_algo | at point <2>, allowing us to continue execution
| | while maintaining all previous program state.
| 2. <RESTORED> |
| | | For scalable algorithms which run on thousands of
| V | threads or which naturally take a long time to run,
| 3. <FINISHED> | a checkpoint-restore framework increases the
| | stability and likelihood that we will get meaningful
+----------------+ results from the programs in a timely manner.
As a simple C library, we envision this program to have the following
interface, included in a single header file - checkpoint_restore.h:
/* initializes the entire checkpoint-restore framework */
int crinit(struct crattr *attr);
/* the meat of the checkpoint-restore framework, self explanatory */
int checkpoint(void);
int restore(void);
/* non-volatile memory allocation and management */
void *crmalloc(size_t size);
void *crrealloc(void *ptr, size_t size);
void crfree(void *ptr);
/* non-volatile thread library, based off of pthreads */
int crthread_create(crthread_t *thread,
void *(*start_routine), (void *), void *arg);
int crthread_join(crthread_t thread, void **retval);
int crthread_cancel(crthread_t thread);
int crthread_kill(crthread_t thread, int sig);
/* non-volatile semaphores, based off of POSIX semaphores */
int crsem_init(crsem_t *sem, unsigned int value);
int crsem_post(crsem_t *sem);
int crsem_wait(crsem_t *sem);
-----------------------------------------------------------------------------
Q: What gets checkpointed? What would we need to checkpoint, and how would we
restore program state after checkpointing?
-----------------------------------------------------------------------------
Overall, our strategy for creating a checkpoint-restore framework requires,
at the very minimum, three robust modules:
1. A non-volatile memory management system
2. A threading and synchronization system
3. A checkpoint decisions subsystem which controls <1> and <2>.
4. A restart routine which initializes or reinitializes <1>, <2>, and <3>
Because of the reliance on kernel-level solutions, we will not allow for
checkpoint-restore on the main thread. Modern solutions which can checkpoint
and restore the main thread each heavily rely on kernel-level solutions.
Our solution only allows for the checkpointing and restoration of all
user-spawned threads.
Below is an overall system architecture diagram for how this solution will
checkpoint and restore threads.
+--------------------+
| | spawns user threads
| +-----------------------------------------++
| | signals threads for manual checkpoint ||
| Main Application +------------------------++ ||
| | synchronization || ||
| +-------++ || ||
| | || || ||
+-----++-------------+ || || ||
|| spawned || || ||
|| upon crinit() || || ||
|| <restart routine> || || ||
|| || user-spawned || thread block ||
|| +--------||---------------||---------------||--------+
VV | VV VV VV |
+------------+ | +------------+ +------------+ +------------+ |
| CR Control | | | CRThread 1 | | CRThread 2 | | CRThread 3 | |
| Thread | | | \ | | \ | | \ | |
| \ | | | \ | | \ | | \ | |
| / | | | / | | / | | / | |
| / | | | / | | / | | / | |
| \ | CTRL | | \ | | \ | | \ | |
| \ +----->| | \ | | \ | | \ | |
| / | | | / | | / | | / | |
| / | | | / | | / | | / | |
| | | | | | | | | | | | | |
| V | | | V | | V | | V | |
| | | | | | | | | |
+---+----+---+ | +---+----+---+ +---+----+---+ +---+----+---+ |
| | | | | | | | | |
| | +------|----|-----------|----|-----------|----|------+
| | | | | | | |
| | V V V V V V
| | +-------------------------------------------------+
| +------------>| Volatile Heap |
| heap / cache +-------------------------------------------------+
| management ||
| VV +------------+
| +--------------+ | Heartbeat |
+---------------------->| CR Subsystem |<----+ Checkpoint |
internal C/R management +--------------+ | Timer |
|| +------------+
VV
+--------------------------------------------------------------------------+
| Non-Volatile Heap (Filesystem) |
+--------------------------------------------------------------------------+
In essence, what we will be doing is extending the regular user application
so that it spawns threads which each can be checkpointed. Initially, these
threads, spawned in our own user thread block module, run in volatile
memory - their stacks will actually be allocated from a heap of our design.
We specifically need to reimplement our heap in order to provide for a low
level method to decide what memory needs to be cached, and when.
The functions setjmp(jmp_buf env) and longjmp(jmp_buf env, int val) allow us
to set jump locations and then jump to them - we could put the jump buffer
into some form of non-volatile storage upon checkpoint, mmap() the addresses
so that they stay consistent between runs, and then longjmp() upon restoraton.
+--------------------+
| |
| Main Application |
| |
+--+-----------------+
|
V (Non-Volatile Store) CRStore
+----------+-----------+-------------------------------------------------+
| METADATA | jmp_buf[] | Thread-Local Store --> <-- Stack |
+----------+-----------+-------------------------------------------------+
------------------> increasing addresses ------------------>
When restoring the program, we mmap() the jmp_buf[] and then jump to the
currently executing point in our program. We need to make sure that the
mmap() addresses are coherent with before the checkpoint as part of our
implementation.
The process specified above will occur for each thread which was
checkpointed successfully. The threads will be loaded back into volatile
memory, meaning the volatile heap will be restored to its state prior to the
crash. Then, each thread's stack pointer will be readjusted to the last
selected checkpoint as set by setjmp(), and the program should continue
execution as normal.
-----------------------------------------------------------------------------
Q: What happens if we get a CRASH while in the midst of a checkpoint? Wouldn't
we still be in the middle of writing to a file?
-----------------------------------------------------------------------------
- Solution Idea:
Use two output files. The output file would have a memory layout which has
a starting METADATA section and a separate PAYLOAD section.
+----------+---------+
| METADATA | PAYLOAD |
+----+-----+---------+
|
+---> this metadata contains a flag called WRITE_LOCK at some
predefined offset
When we checkpoint, WRITE_LOCK is used like a semaphore or lock, in that
we first set WRITE_LOCK to 0, then write all of our data, then set
WRITE_LOCK back to 1. We know that we had a mid-write crash if WRITE_LOCK
is still zero when we try to restore that data segment.
To mitigate this issue, we should have a double-buffered output. Let's say
that we're just considering a single-threaded application, for now. We'd
have a layout that looks like this:
+--------------------+
| |
| Main Application |
| |
+--+--------------+--+
| | (CHP = Checkpoint)
CHP 1 | | CHP 2
CHP 3 +-----+ +-----+ CHP 4
CHP 5 | | CHP 6
V V
+-----------+ +-----------+
| CRStore 1 | | CRStore 2 |
+-----------+ +-----------+
Checkpoints can be made to opposite buffers at each write. Only one buffer
is ever written to at a checkpoint. While this approach can certainly have
issues with many slower writes, it DOES solve the crash-on-checkpoint
problem, in which we would otherwise have no way to recover from a crash
while our framework was performing a background checkpoint. Observe the
diagram and explanation below for a brief overview on the checkpointing
process:
Checkpoint Operation Volatile Memory State to CRStore
-------------------- --------------------------------
META PAYLOAD
+------+---------------------------+
1. Lock the CRStore Buffer and | WL=0 | X X XX XX |
write the lock into CRStore +--+---+---------------------------+
|
V
+------+---------------------------+
CRStore | WL=0 | |
+------+---------------------------+
META PAYLOAD
+------+---------------------------+
2. Deposit all updated memory | WL=0 | X X XX XX |
blocks in the payload to +------+-+-----+------++-------++--+
the CRStore | | || ||
V V VV VV
+------+---------------------------+
CRStore | WL=0 | + + ++ ++ |
+------+---------------------------+
META PAYLOAD
+------+---------------------------+
3. Clear dirty markers on all | WL=0 | |
previously written regions +------+---------------------------+
+------+---------------------------+
CRStore | WL=0 | + + ++ ++ |
+------+---------------------------+
META PAYLOAD
+------+---------------------------+
4. Unlock the CRStore Buffer and | WL=1 | |
write the unlock into CRStore +------+---------------------------+
|
V
+------+---------------------------+
CRStore | WL=1 | + + ++ ++ |
+------+---------------------------+
The theory behind why we're doing this is to be able to detect if a crash
which occurs mid-checkpoint has corrupted our data. We define a data
corruption to have occurred when a crash occurs anywhere within steps 2
and 3, inclusive.
When restoring the program state, we can observe the states of each of the
non-volatile stores. If both of the stores have a WRITE_LOCK of 1, we
choose the store with the most recent timestamp to restore the program.
If one of the stores has a WRITE_LOCK of 0, however, we can safely assume
that a crash-on-write has occurred for that store, which means that the
only safe restoration point is the store with a WRITE_LOCK of 1.
-----------------------------------------------------------------------------
Q: How will we test this system?
-----------------------------------------------------------------------------
Our first