notes.txt

================================================================================
Persistent User Programs: A Checkpoint / Restore Framework
Team SIGSYS(31): Savannah Amos and Matthew Zhong
================================================================================

   -----------------------------------------------------------------------------
Q: What does this system need to accomplish? From the top-down, what are the
   ways that someone can use this system, and why would they use it?
   -----------------------------------------------------------------------------
    In this project, our objective is to create a checkpoint-restore framework
    which users can integrate into C applications. Using various forms of 
    persistent memory, this framework will enable a user-level application to
    continue computation from the middle of a crash, as if the program never had
    crashed in the first place. 

    Imagine if we had a program which was performing some form of complex 
    computation over a long period of time, like for example, a machine learning
    and training algorithm. Normally, abrupt termination would spell disaster
    for the algorithm - we would need to restart the algorithm from the very 
    beginning and lose all of this progress.

        +----------------+  In the diagram on the left, we can see what happens 
        | bash ./ml_algo |  to the execution of a simple program when a crash 
        |                |  occurs. The issue is that, after crashing at <1>,
        | 0. <STARTED>   |  we would need to completely restart the program
        |        |       |  at <0>, and lose all of our progress.
        |        V       |
        | 1. <CRASHED>   |  A checkpoint-restore algorithm would allow us to
        |                |  pick up the program from where the program crashed
        | bash ./ml_algo |  at point <2>, allowing us to continue execution 
        |                |  while maintaining all previous program state.
        | 2. <RESTORED>  |
        |        |       |  For scalable algorithms which run on thousands of
        |        V       |  threads or which naturally take a long time to run,
        | 3. <FINISHED>  |  a checkpoint-restore framework increases the 
        |                |  stability and likelihood that we will get meaningful
        +----------------+  results from the programs in a timely manner. 
    
    As a simple C library, we envision this program to have the following 
    interface, included in a single header file - checkpoint_restore.h:

    /* initializes the entire checkpoint-restore framework */
    int crinit(struct crattr *attr);

    /* the meat of the checkpoint-restore framework, self explanatory */
    int checkpoint(void);
    int restore(void);

    /* non-volatile memory allocation and management */
    void *crmalloc(size_t size);
    void *crrealloc(void *ptr, size_t size);
    void crfree(void *ptr);

    /* non-volatile thread library, based off of pthreads */
    int crthread_create(crthread_t *thread, 
                        void *(*start_routine), (void *), void *arg);
    int crthread_join(crthread_t thread, void **retval);
    int crthread_cancel(crthread_t thread);
    int crthread_kill(crthread_t thread, int sig);

    /* non-volatile semaphores, based off of POSIX semaphores */
    int crsem_init(crsem_t *sem, unsigned int value);
    int crsem_post(crsem_t *sem);
    int crsem_wait(crsem_t *sem);

   -----------------------------------------------------------------------------
Q: What gets checkpointed? What would we need to checkpoint, and how would we
   restore program state after checkpointing?
   -----------------------------------------------------------------------------
    
    Overall, our strategy for creating a checkpoint-restore framework requires,
    at the very minimum, three robust modules:
        1. A non-volatile memory management system
        2. A threading and synchronization system
        3. A checkpoint decisions subsystem which controls <1> and <2>.
        4. A restart routine which initializes or reinitializes <1>, <2>, and <3>

    Because of the reliance on kernel-level solutions, we will not allow for 
    checkpoint-restore on the main thread. Modern solutions which can checkpoint
    and restore the main thread each heavily rely on kernel-level solutions.
    Our solution only allows for the checkpointing and restoration of all 
    user-spawned threads.

    Below is an overall system architecture diagram for how this solution will
    checkpoint and restore threads.

      +--------------------+
      |                    | spawns user threads
      |                    +-----------------------------------------++
      |                    | signals threads for manual checkpoint   ||
      |  Main Application  +------------------------++               ||
      |                    | synchronization        ||               || 
      |                    +-------++               ||               || 
      |                    |       ||               ||               ||
      +-----++-------------+       ||               ||               ||
            || spawned             ||               ||               ||
            || upon crinit()       ||               ||               ||
            || <restart routine>   ||               ||               ||
            ||                     ||  user-spawned || thread block  ||
            ||            +--------||---------------||---------------||--------+
            VV            |        VV               VV               VV        |
      +------------+      |  +------------+   +------------+   +------------+  |
      | CR Control |      |  | CRThread 1 |   | CRThread 2 |   | CRThread 3 |  |
      |   Thread   |      |  |     \      |   |     \      |   |     \      |  |
      |      \     |      |  |      \     |   |      \     |   |      \     |  |
      |      /     |      |  |      /     |   |      /     |   |      /     |  |
      |     /      |      |  |     /      |   |     /      |   |     /      |  |
      |     \      | CTRL |  |     \      |   |     \      |   |     \      |  |
      |      \     +----->|  |      \     |   |      \     |   |      \     |  |
      |      /     |      |  |      /     |   |      /     |   |      /     |  |
      |     /      |      |  |     /      |   |     /      |   |     /      |  |
      |     |      |      |  |     |      |   |     |      |   |     |      |  |
      |     V      |      |  |     V      |   |     V      |   |     V      |  |
      |            |      |  |            |   |            |   |            |  |
      +---+----+---+      |  +---+----+---+   +---+----+---+   +---+----+---+  |
          |    |          |      |    |           |    |           |    |      |
          |    |          +------|----|-----------|----|-----------|----|------+
          |    |                 |    |           |    |           |    |      
          |    |                 V    V           V    V           V    V
          |    |             +-------------------------------------------------+
          |    +------------>|                 Volatile Heap                   |
          |    heap / cache  +-------------------------------------------------+
          |    management                ||
          |                              VV            +------------+
          |                       +--------------+     | Heartbeat  |
          +---------------------->| CR Subsystem |<----+ Checkpoint |
          internal C/R management +--------------+     | Timer      |
                                         ||            +------------+
                                         VV
    +--------------------------------------------------------------------------+
    |                       Non-Volatile Heap (Filesystem)                     |
    +--------------------------------------------------------------------------+

    In essence, what we will be doing is extending the regular user application
    so that it spawns threads which each can be checkpointed. Initially, these
    threads, spawned in our own user thread block module, run in volatile 
    memory - their stacks will actually be allocated from a heap of our design. 
    We specifically need to reimplement our heap in order to provide for a low 
    level method to decide what memory needs to be cached, and when.

    The functions setjmp(jmp_buf env) and longjmp(jmp_buf env, int val) allow us
    to set jump locations and then jump to them - we could put the jump buffer 
    into some form of non-volatile storage upon checkpoint, mmap() the addresses
    so that they stay consistent between runs, and then longjmp() upon restoraton.

      +--------------------+
      |                    |
      |  Main Application  |
      |                    |
      +--+-----------------+
         |
         V                                          (Non-Volatile Store) CRStore
      +----------+-----------+-------------------------------------------------+
      | METADATA | jmp_buf[] | Thread-Local Store -->                <-- Stack |
      +----------+-----------+-------------------------------------------------+
             ------------------> increasing addresses ------------------>

    When restoring the program, we mmap() the jmp_buf[] and then jump to the 
    currently executing point in our program. We need to make sure that the 
    mmap() addresses are coherent with before the checkpoint as part of our
    implementation. 

    The process specified above will occur for each thread which was 
    checkpointed successfully. The threads will be loaded back into volatile
    memory, meaning the volatile heap will be restored to its state prior to the
    crash. Then, each thread's stack pointer will be readjusted to the last
    selected checkpoint as set by setjmp(), and the program should continue
    execution as normal.

   -----------------------------------------------------------------------------
Q: What happens if we get a CRASH while in the midst of a checkpoint? Wouldn't
   we still be in the middle of writing to a file?
   -----------------------------------------------------------------------------

    - Solution Idea: 
      Use two output files. The output file would have a memory layout which has
      a starting METADATA section and a separate PAYLOAD section.
        
        +----------+---------+
        | METADATA | PAYLOAD |
        +----+-----+---------+
             |
             +---> this metadata contains a flag called WRITE_LOCK at some
                   predefined offset

      When we checkpoint, WRITE_LOCK is used like a semaphore or lock, in that 
      we first set WRITE_LOCK to 0, then write all of our data, then set 
      WRITE_LOCK back to 1. We know that we had a mid-write crash if WRITE_LOCK
      is still zero when we try to restore that data segment.

      To mitigate this issue, we should have a double-buffered output. Let's say
      that we're just considering a single-threaded application, for now. We'd 
      have a layout that looks like this:

                             +--------------------+
                             |                    |
                             |  Main Application  |
                             |                    |
                             +--+--------------+--+
                                |              |              (CHP = Checkpoint)
                    CHP 1       |              |       CHP 2
                    CHP 3 +-----+              +-----+ CHP 4
                    CHP 5 |                          | CHP 6
                          V                          V
                    +-----------+              +-----------+
                    | CRStore 1 |              | CRStore 2 |
                    +-----------+              +-----------+

      Checkpoints can be made to opposite buffers at each write. Only one buffer
      is ever written to at a checkpoint. While this approach can certainly have
      issues with many slower writes, it DOES solve the crash-on-checkpoint 
      problem, in which we would otherwise have no way to recover from a crash 
      while our framework was performing a background checkpoint. Observe the
      diagram and explanation below for a brief overview on the checkpointing
      process:

          Checkpoint Operation              Volatile Memory State to CRStore
          --------------------              --------------------------------

                                             META   PAYLOAD
                                            +------+---------------------------+
          1. Lock the CRStore Buffer and    | WL=0 | X     X      XX       XX  |
             write the lock into CRStore    +--+---+---------------------------+
                                               |
                                               V  
                                            +------+---------------------------+
                                    CRStore | WL=0 |                           |
                                            +------+---------------------------+
                                            
                                             META   PAYLOAD
                                            +------+---------------------------+
           2. Deposit all updated memory    | WL=0 | X     X      XX       XX  |
              blocks in the payload to      +------+-+-----+------++-------++--+
              the CRStore                            |     |      ||       ||
                                                     V     V      VV       VV
                                            +------+---------------------------+
                                    CRStore | WL=0 | +     +      ++       ++  |
                                            +------+---------------------------+

                                             META   PAYLOAD
                                            +------+---------------------------+
          3. Clear dirty markers on all     | WL=0 |                           |
             previously written regions     +------+---------------------------+

                                            +------+---------------------------+
                                    CRStore | WL=0 | +     +      ++       ++  |
                                            +------+---------------------------+

                                             META   PAYLOAD
                                            +------+---------------------------+
          4. Unlock the CRStore Buffer and  | WL=1 |                           |
             write the unlock into CRStore  +------+---------------------------+
                                               |
                                               V  
                                            +------+---------------------------+
                                    CRStore | WL=1 | +     +      ++       ++  |
                                            +------+---------------------------+
      
      The theory behind why we're doing this is to be able to detect if a crash
      which occurs mid-checkpoint has corrupted our data. We define a data 
      corruption to have occurred when a crash occurs anywhere within steps 2 
      and 3, inclusive.

      When restoring the program state, we can observe the states of each of the
      non-volatile stores. If both of the stores have a WRITE_LOCK of 1, we 
      choose the store with the most recent timestamp to restore the program.
      If one of the stores has a WRITE_LOCK of 0, however, we can safely assume
      that a crash-on-write has occurred for that store, which means that the
      only safe restoration point is the store with a WRITE_LOCK of 1.

   -----------------------------------------------------------------------------
Q: How will we test this system?
   -----------------------------------------------------------------------------
    Our first