-
Notifications
You must be signed in to change notification settings - Fork 119
/
map_internals.txt
496 lines (395 loc) · 19.9 KB
/
map_internals.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
Dhara map internals
Daniel Beer <dlbeer@gmail.com>
7 Dec 2013
The map data structure implements the top-level flash translation layer.
It provides a disk-like interface, where page-sized sectors can be
written (and rewritten, in arbitrary order) to logical sector numbers.
Given a logical sector number, we can read back the most recently
written data. Both of these operations are fairly efficient (O(log N),
where n is the maximum number of possible sector addresses).
The structure also has to handle, with the assistance of the journal,
bad block recovery and garbage collection.
The journal provides a persistent queue, with two basic operations:
- enqueue, which takes page data and a fixed-size tag block of
metadata, and adds the tagged page to the front of the queue.
- dequeue, which removes one page from the back of the queue.
The metadata is of a universally fixed size -- it is unrelated to the
page size, OOB size, or any other characteristic of the chip.
Once items are enqueued, they (and their associated metadata) can be
referred to by the page at which they were written. They remain
referable in this way until the exit the back of the queue via a dequeue
operation. Both enqueue and dequeue are constant-time operations, and so
is reading, once we know the page number we want to read from. But how
do we efficiently associate logical sector numbers to page numbers?
Functional radix trees
======================
Once pages are written, they have a fixed page number (which we can't
choose in advance) which remains valid for their lifetime. Written pages
cannot be altered. What we want is a mutable mapping from sector
addresses (which we *can* choose) which allows us to rewrite pages in
arbitrary order.
We start by considering a functional radix tree. A radix tree is a
structure which maps fixed-size keys to values. In particular, we want
to consider a *binary* radix tree. For the remainder of this document,
we'll assume 4-bit sector addresses (they're 32-bits in practice, but
that makes the diagrams too hard to draw). The length of the addresses,
and hence the height of the tree, will be referred to as H.
With 4-bit sector addresses, we have a radix tree that's 4 non-leaf
levels deep, with two down-pointers per leaf-node. Each level is
implicitly associated with a bit position in the key. To find the data
associated with a sector, start at the root, and consider the first bit
of the key. If the bit is 0, follow the left down-pointer. If 1, follow
right. Arriving at the next node, consider the 2nd bit of the key.
Repeat until we have simultaneously exhausted all key bits, and arrived
at the leaf level.
Suppose we have stored in our tree data for sectors 0000, 0001, 1000 and
1010. Our tree would look something like this:
(root)
/ \
(0) (1)
/ \
(00) (10)
/ / \
(000) (100) (101)
/ \ / \
[0000] [0001] [1000] [1010]
What happens if we want to store new data? Our radix tree is functional,
which means that we can't modify any objects already present (except the
root pointer), but we can create new objects, and we can update the root
pointer.
Suppose we wanted to rewrite the data associated with key 1010. We would
allocate new objects and update the root as follows:
+---------------------+
| |
| (old-root) | (new-root)
| / \ +---------+ \
+--> (0) (1)* (1)
/ +--\-----------+ \
(00) | (10)* | (10)
/ v / \ +-----------+ \
(000) (100) (101)* (101)
/ \ / \ \
[0000] [0001] [1000] [1010]* [1010]
We have constructed a new version of the sector data, and new versions
of all the intermediate nodes which link the root to the rewritten
sector. At each intermediate, where there is an alternative non-NULL
path, we've reused the same child as in the old version of the tree. The
objects marked with an asterisk (*) are not reachable from the new root;
they have become garbage.
This is the conceptual basis of the Dhara map structure. But, we can
improve on the absolute space-efficiency of this, and reduce the average
number of reads necessary to find a given leaf.
Alt-pointer queue
=================
Note that when we updated our radix tree, we always construct exactly
one new leaf node (the sector data) and exactly 4 (being the number of
key bits) non-leaf nodes. We can therefore always pack these items
together in a fixed-size allocation unit.
Furthermore, note that in any update, the set of non-leaf nodes always
form a path to the sector which has been updated. Therefore, at least
half of the down-pointers in the set can be omitted, since they can be
inferred by association (provided we know which level each non-leaf
belongs to). The only piece of information for each non-leaf which can't
be inferred is the pointer which *doesn't* point along the path to the
updated sector. We call these pointers "alt-pointers".
Instead of storing one leaf and H non-leaf nodes for each update, we
simply store together:
* {s0, s1, s2, ..., s(H-1)}: An array of bits, giving the sector
address of this update.
* {A0, A1, A2, ..., A(H-1)}: An array of H alt-pointers.
* The new data associated with this sector address
Given any update record, and some integer 0 <= k < H, the following
property holds:
Ak is either NULL, or it points to the most recent update record
whose sector address shares its first k bits with this one.
This is analogous to the prefix property which is implied by the
down-pointers in the radix tree. To give a concrete example, suppose we
reconstruct the radix tree of our earlier example by starting with a
single sector, 0001. We start with a single update record:
+---+
| 0 |
| 0 |
| 0 |
| 1 |
+---+
All the alt-pointers are NULL, because they have nothing to point to.
Now, we update the map by adding an update record for the sector address
0000:
+---+ +---+
| 0 | | 0 |
| 0 | | 0 |
| 0 | | 0 |
| 1 |<----0 |
+---+ +---+
The new record has a single non-NULL alt-pointer at position H-1 (= 3),
and this pointer points to the previous record. Note that the two
records have in common their first 3 sector address bits. Now we add
1010 and 1000, to give:
+---+ +---+ +---+ +---+
| 0 | | 0 |<----1--------1 |
| 0 | | 0 | | 0 | | 0 |
| 0 | | 0 | | 1 |<-----0 |
| 1 |<----0 | | 0 | | 0 |
+---+ +---+ +---+ +---+
Note that both the third and forth records from the left have an
alt-pointer at position 0 which points to the second record. Now,
suppose we replace the data on page 1010, as in the previous example:
+---+ +---+ +---+ +---+ +---+
| 0 | | 0 |<----1--------1--------1 |
| 0 | | 0 | | 0 | | 0 | | 0 |
| 0 | | 0 | | 1 |<-----0 |<-----1 |
| 1 |<----0 | | 0 | | 0 | | 0 |
+---+ +---+ +---+ +---+ +---+
We now have two update records which both have the sector address 1010
-- but only one of them (the rightmost) is reachable by following the
lookup procedure. This procedure is:
* Start at the most recent (rightmost) record.
* For i = 0 .. H-1:
* If the current record is NULL:
* return NOT-FOUND
* If bit i of the target address differs from bit i of the current
record's sector address:
* Follow alt-pointer i to obtain a new current record.
* Return the current record.
The procedure for updating a record has a very similar structure. To
update a record, we must first look it up. But in the process of doing
so, we build up an array of alt-pointers. The inner part of the loop
changes so that it builds A[] as:
* If the current record is NULL:
* set Ai..A(H-1) to NULL
* return NOT-FOUND
* If bit i of the target address differs from bit i of the current
record's sector address:
* set Ai = address of the current record
* Follow alt-pointer i to obtain a new current record.
* Else:
* set Ai = alt-pointer i of the current record
Garbage collection
==================
We now have a data structure which can be represented within the
confines of the journal, and requires only the enqueuing of immutable
records to maintain an efficient mutable mapping. So far, so good -- but
we can't allow the journal to grow without bound. At some point, we need
to dequeue records to free space.
Consider the following journal state:
+---+ +---+ +---+ +---+ +---+ +---+
| 0 | | 0 |<----1--------1--------1 |<----0 |
| 0 | | 0 | | 0 | | 0 | | 0 | | 0 |
| 0 | | 0 | | 1 |<-----0 |<-----1 | | 0 |
| 1 |<----0 |<+ | 0 | | 0 | | 0 | +---1 |
+---+ +---+ | +---+ +---+ +---+ | +---+
| |
+-------------------------+
By direct inspection of the last page of the journal, we see that it
contains an update record for sector 0001. However, following the lookup
procedure for sector 0001 lands us on a different (more recent) page.
This page is therefore not required for its leaf data.
But what about the alt-pointers? You can see in the diagram above that
at least one other page points to the leftmost page. Perhaps it's not
safe to remove this page after all?
Let L be the leftmost page, and let R be its most recent replacement (R
must exist, because otherwise we wouldn't be able to consider the
removal of L in the first place). Now let M be some page with an
alt-pointer referencing L, at some index (k). Will this alt-pointer ever
need to be followed by the lookup procedure?
Because M points at level k to L, it must be that M's sector address
shares its first k bits with that of L. It must also share the first k
bits with R (because R and L have the same address). Furthermore, R must
be more recent that M (if it were not so, then M would be pointing to R,
rather than L -- because alt-pointers always point to the most recent
prefix-sharing update).
Now suppose we're searching for some sector T, and at some point during
the lookup, we land on page M by following an alt-pointer at a depth of
i. By the properties of alt-pointers, page M shares a prefix of at least
(i+1) bits with the target address T, and is the most recent page to do
so. It follows from this that (i+1) > k, because otherwise R would be
the most recent page with the required prefix.
Since arriving at M requires following an alt-pointer at a level i >= k,
we will never follow an alt-pointer *from* M at level k. Therefore, the
alt-pointer in question is no longer required, and the page L can be
safely removed.
To summarize: if the last page in the journal is not the most recent
update of its sector address, it may be safely dequeued.
Dequeueing the page in question, we get:
+---+ +---+ +---+ +---+ +---+
| 0 |<----1--------1--------1 |<----0 |
| 0 | | 0 | | 0 | | 0 | | 0 |
| 0 | | 1 |<-----0 |<-----1 | | 0 |
X--0 |<+ | 0 | | 0 | | 0 | +---1 |
+---+ | +---+ +---+ +---+ | +---+
| |
+-------------------------+
This is a fine scheme for removing obsolete pages from the leftmost end
of the queue, but what if the queue contains garbage in the middle? In
the diagram above, the journal contains at least one garbage page (the
first occurance of 1010), but the leftmost page isn't garbage.
There is an operation, which we call "repack", that we can perform which
preserves the mapping while obsoleting the leftmost page. We simply
perform an update which recopies the sector data of the leftmost page.
Performing this operation on the journal above yields:
+---+ +---+ +---+ +---+ +---+ +---+
| 0 |<----1--------1--------1 |<----0-------0 |
| 0 | | 0 | | 0 | | 0 | | 0 | | 0 |
| 0 | | 1 |<-----0 |<-----1 | | 0 | | 0 |
X--0 |<+ | 0 | | 0 | | 0 | +---1 |<----0 |
+---+ | +---+ +---+ +---+ | +---+ +---+
| |
+-------------------------+
Following the repack, the leftmost page becomes garbage, and can be
dequeued:
+---+ +---+ +---+ +---+ +---+
X--1--------1--------1 |<----0-------0 |
| 0 | | 0 | | 0 | | 0 | | 0 |
| 1 |<-----0 |<-----1 | | 0 | | 0 |
| 0 | | 0 | | 0 | X--1 |<----0 |
+---+ +---+ +---+ +---+ +---+
After this dequeue, the mid-journal garbage page is now exposed as the
leftmost page, and can be removed with a further dequeue operation, to
yield a completely garbage-free journal:
+---+ +---+ +---+ +---+
X--1--------1 |<----0-------0 |
| 0 | | 0 | | 0 | | 0 |
X--0 |<-----1 | | 0 | | 0 |
| 0 | | 0 | X--1 |<----0 |
+---+ +---+ +---+ +---+
Our incremental garbage collection step can be summarized as:
* Let S = the sector address of the leftmost page in the journal
* Look up S to find its page address, P
* If P is the leftmost page:
* Rewrite S, using the data stored on page P
* Dequeue the leftmost page
This might yield a reduction in journal size, but it's not guaranteed to
do so. The one thing it does guarantee is that any garbage in the
journal will be removed, given sufficient garbage collections steps.
Overflow avoidance
------------------
Our only means of reducing the size of the journal requires the ability
to enqueue new items if necessary. This works fine, but only if we make
sure that we never fill the journal completely. If this happens, we're
stuck, and can't make any progress.
We could avoid getting stuck by making sure we perform sufficient
garbage collection if we seem to be running low on capacity. But we
don't want to spend all our time doing this, because we can't predict
how many garbage collection steps will be required in order to reduce
the size of the journal.
Our strategy for avoiding overflow can be described as follows: let Cj
be the effective size of the journal, in pages (we subtract a certain
safety margin to allow for the possibility of blocks going bad). We
choose some integer R >= 1, which we call the "collection ratio". We
then declare the maximum *map* capacity to be:
Cm = Cj * R / (R+1)
A count is kept of allocated pages in the map, and it is prevented from
exceeding Cm. When we go to perform an update, we check the size of the
map (Sm) against the size of the journal (Sj). If Sj >= Cm, then we
perform R steps of the garbage collection algorithm before performing
our write.
Synchronization
---------------
The journal doesn't have an explicit synchronization algorithm, but
synchronization points occur periodically for every so many enqueue
operations.
The map layer provides an explicit synchronization algorithm. When
requested, we perform garbage collection steps until we reach a
synchronization point. Of course, garbage collection may simply result
in the dequeueing of the tail without a corresponding enqueue.
If the journal is exhausted of pages, we just repack the front page as
many times as necessary.
Bad-block recovery
------------------
If a block goes bad during a journal enqueue attempt, the attempt fails
and the journal enters recovery mode. In this mode, we're able to
enumerate the pages located on the failed block, along with their
metadata.
We recovery by applying the garbage collection procedure to each page in
the list: if the page is not the most recent copy of the sector it
represents, it is skipped. Otherwise, we repack it at the front of the
journal.
If there are no pages left in the bad block, and recovery is not yet
complete (i.e. the journal is not yet synchronized), we simply repack
the front of the queue to pad to a synchronization point.
This algorithm is stateless, so no special handling is required if
recovery needs to be restarted.
Deletion
========
If we want to delete a page from a radix tree, we take the non-leaf node
that points to the victim page, and update the pointer to be NULL. In
fact, since this may result in the parent of said node being empty, we
can often prune the tree at a higher level -- specifically, at the level
of the first node which points to a subtree containing nothing other
than the victim page. Call this node the "delete-root". Note that the
delete-root, by definition, has two non-NULL child pointers.
To do this on a functional radix tree, we'd have to update all nodes up
to and including the delete-root. How do we do this in our case? All
updates require the writing of a page. We could do this by writing a
placeholder page and associated metadata, but this is wasteful and we
might never be able to get rid of the placeholder through
garbage-collection (depending on subsequent updates).
What we do instead is rewrite a different page which contains the
delete-root in its path. From the delete-root, choose any page in the
"other" subtree, and repack it. But, update the metadata so that the
alt-pointer at the level of the delete-root is now NULL.
Having done this, we've performed an action equivalent to repacking,
which creates one page, but also removed one logical page from the map.
Of course, there is one case that must be handled specially: deletion of
the tree root.
Memory layout
=============
This is a memory layout example with the following configuration:
* log2_ppc = 2, ie check point group contains 4 pages
* log2_ppb = 4, ie 16 pages per erase block
A single erase block layout, each slot represents a physical flash page:
+------+------+------+------+
| data | data | data | cp |
+------+------+------+------+
| data | data | data | cp |
+------+------+------+------+
| data | data | data | cp |
+------+------+------+------+
| data | data | data | cp |
+------+------+------+------+
* data: User data (len = (1 << log2_page_size))
* cp: Check point metadata for last 3 pages, ie (1 << log2_ppc) - 1
Checkpoint
----------
Each checkpoint contains a header, a cookie and N metadata structures, one for
each data page of the checkpoint group. The amount of metadata (the size of
checkpoint group) depends on the dhara_journal.log2_ppc field:
N = (1 << log2_ppb) - 1.
|- DHARA_HEADER_SIZE -|- DHARA_COOKIE_SIZE -|---- N * DHARA_META_SIZE ----|
+---------------------+---------------------+-----------------------------+
| header | cookie | meta1 : meta2 : ... : metaN |
+---------------------+---------------------+-----------------------------+
### Checkpoint header
The checkpoint header size is 16 bytes long (DHARA_HEADER_SIZE), and has the
following layout:
+-------------------------------+
| 'D' | 'h' | 'a' | epoch |
+-------------------------------+
| tail |
+-------------------------------+
| bb_current |
+-------------------------------+
| bb_last |
+-------------------------------+
* "Dha": Magic number to identify the header
* epoch: Number of times the journal passes the end and wraps around
* tail: Number of the last used page
* bb_current: Number of bad blocks before the current head
* bb_last: Estimation of the total number of bad blocks
### Checkpoint cookie
The cookie field is 4 byte long (DHARA_COOKIE_SIZE), and it's used by the map
layer for store the current number of mapped pages.
### Checkpoint metadata
This zone is 132 bytes long (DHARA_META_SIZE), and it's where the radix tree is
implemented, thus, where the map between physical pages and logical sectors is
saved. It contains the alt pointers vector, {A0, A1, A2, ..., A(31)}, and the
logical sector number (id). There is one of this structure per physical mapped
page.
|---------- 32 bits ------------|
+-------------------------------+
| id |
+-------------------------------+
| ALT |
: POINTERS :
| 32x4B |
+-------------------------------+