By Jon Ludlam - 2014-02-01
This article is part of a series documenting how MirageOS applications run under Xen. This article is about suspend, resume and live migration.
One of the important advantages of using virtual machines (VMs) over physical machines to run your operating systems is that management of VMs is simpler and more powerful than managing physical computers. One new tool in the management toolkit is that of suspending and resuming to a state file. In many ways equivalent to shutting the lid on a laptop and having it go to sleep, a VM can be suspended such that it no longer consumes any memory or CPU resources on its host, and resumed at a later point when required. Unlike a laptop, the state of the VM is encapsulated in a state file on disk, which can be copied if you want to take a backup, replicated many times if you want to have multiple instances of your VM running, or copied to another physical host if you would like to run it elsewhere. This operation also forms the basis of live migration, where a running VM has its state replicated to another physical host in such a way that its execution can be stopped on the original host and almost immediately started on the destination, and users of the services provided by that VM are none the wiser.
For VMs using hardware virtualization instead of paravirtualization, doing this is actually relatively straightforward. The VM is stopped from executing, then the memory is saved to disk, and the state of any device emulator (qemu, in xen's case) is also persisted to disk. To resume, load the qemu state back in, restore the memory, and unpause your domain. The OS within the VM continues running, unaware that anything has changed.
However, most operating systems inside VMs have software installed that is aware that it is running in a VM, and generally speaking, this is where work is required to ensure that these components survive a suspend and resume. In the case of the MirageOS Xen unikernels, it is mainly the IO devices that need to be aware of the changes that happen over the course of the operation. Since our unikernels are not fully virtualised but are paravirtualised kernels, there is also some infrastructure work that is required. The aim of this page is to document how these operations work.
The guiding principle in this work is to minimise the number of exceptional conditions that have to be handled. In some cases, application must be made aware that they have gone through a suspend/resume cycle - for example, anything that is communicating with xenstore. However, in most cases, the application logic doesn't have to be aware of anything in particular happening. For example, the block and network layers can reissue requests that were in flight at the time of the suspend, and therefore any applications using these can carry on without any special logic required.
To explain the process of suspend and resume in MirageOS Xen guests, we will walk though the various operations in sequence.
The suspend example in the mirage-skeleton repository contains the control logic needed to get the guest to be able to suspend, and is therefore a good place to start looking. The first thing that happens when a suspend is requested is that the toolstack organising the operation will signal to the guest that it should begin the process. This can be done via several mechanisms, but the one supported in MirageOS today is by writing a particular key to xenstore:
/local/domain/<n>/control/shutdown = "suspend"
The code that watches for this path is here. The guest then acknowledges this by removing the key. It then jumps to the suspend code in sched.ml.
The first thing that happens there is that we call the Xenstore library to suspend Xenstore. This works by waiting for any in-flight requests to be responded to, then cancelling any threads that are waiting on watches. These have to be cancelled because watches rely on state in the xenstore daemon and therefore have to be reissued (potentially with different paths) when the VM resumes.
Then, the grant tables are suspended via the call to Gnt.suspend, which ends up calling a c function in the MirageOS kernel code. The main reason for calling this is that the mechanism by which the grant code works is via shared memory pages, and these pages are owned by xen and not by the domain itself, which causes problems when suspending the VM as we will see shortly. Although the grant pages are mapped on demand, and thus could be remapped before we've finished, this is fine as we are actually now in a non-blocking part of the suspend code, and no other Lwt threads will be scheduled.
At this point we call the C function in sched_stubs.c. The first thing done there is to rewrite two fields in the start_info page: The MFNs of the xenstore page (store_mfn) and of the console page (console_mfn) are turned into PFNs. This is done so that when the guest is resumed, xenstored and xenconsoled can be given the pages that the guest is expecting to talk to them on. It is the restore code in libxc where the remapping takes place.
We then unmap the shared_info page. This is required because the shared_info page again belongs to xen rather than to the guest, in a similar fashion to the grant pages. The page is allocated during domain creation.
We are now in a position to do the actual suspend hypercall. Interestingly, the suspend hypercall is defined in the header as a three parameter call, but the implementation in xen ignores the 3rd parameter 'srec'. This is actually used by libxc to locate the start_info page. Also of note is that xen will always return success when the domain has suspended, but the hypercall has the notion of being 'cancelled', by which it means the guest has woken up in the same domain as it was when it called the hypercall. This is achieved by having libxc alter the VCPU registers on resume.
At this point, the domain will now be shutdown with reason 'suspend', There is still work that needs to be done however. PV guests have pagetables that reference the real MFNs rather than PFNs, so when the guest is resumed into a different area of a hosts memory, these will need to be rewritten. This is done by canonicalizing the pagetables, which in this context means replacing the MFNs with PFNs. Since the function that maps MFNs to PFNs is only partial, this fails if any of the MFNs are outside of the domain's memory. This is the reason that all foreign pages such as the grant table pages and the shared info page needed to be unmapped before suspending.
We are now in a position to write the guests memory to disk in the suspend image format. If a device emulator (qemu) was running, it would also have its state dumped at this point ready to be resumed later.
When the VM is resumed, libxc loads the saved image back into memory. It then locates the pagetables, and 'uncanonicalizes' them back from PFNs to the new MFNs. The next task is to rewrite the VCPU registers to pass back the suspend return code as mentioned previously and then we are ready to unpause the new domain. At this point, control is handed back to the MirageOS guest as if the hypercall has just returned. At this point, the domain is close to the state of a cleanly started guest, and so we have to reinitialize many of the same things that are done on startup, including enabling event delivery, initialising the timers and so on.
We then return to the ocaml code, and increment the generation count of the event channels, which is explained below. Then, we resume the grant tables, which currently is a no-op as the table is mapped on first (re)use. The activations thread is then restored, and we then restore Xenstore. This is done in this order to satisfy interdependencies - activations need event channels working, xenstore needs grant tables and activations. Once this is done we can move on to a more generic set of resume items: we iterate through a list of other post-resume tasks, populated by other modules (such as mirage-block-xen which are currently assumed to be dependency free.
An example of a resume hook can be seen in the block driver package, which is added when the module initialises. It registers a callback that iterates through the list of connected devices and re-plugs them. It then calls shutdown which wakes up every thread waiting for a response with an exception, and also any thread that is waiting for a free slot. These exceptions are handled back in mirage-block-xen, which simply retries the whole operation, being careful to use the refreshed information about the backend.
The only thread that might possibly be running is the service thread that takes responses from the ring and demultiplexes them, and this thread will be killed when it attempts to wait on the event channel. Whenever an event channel is bound, we pair up the integer event channel number with a 'generation count' that is incremented on resume. Whenever the MirageOS guest attempts to wait for a signal from an event channel, the generation count is checked, and a stale generation results in a Lwt thread failure. The generation count is not checked when attempting to notify via an event channel, as this is a benign failure - it is only if we try to wait for a notification that the error occurs. Any threads that were already waiting at the point the domain suspended will be killed on resume by the activations logic. In the case of the block device, this error mode is handled by simply letting the thread die. A new one will have been set up during the resume as part of the replug.
Live migration also uses this mechanism to move a running VM from one host to another with very little downtime. In this case, when the migration begins, the guest is switched to log-dirty mode, where the hypervisor starts to track which of the guests pages have been written to. The toolstack can then iteratively go through these pages and send them to the destination using the same protocol as suspending to disk, but this time unmarshalling them straight back into memory. When it decides it has done enough iteratively, it then invokes the suspend logic above and sends through only the last few dirty pages, which will be much faster than the entire memory image. The resume logic is then invoked and the domain starts running again.