Commit 251501a3 authored by Peter Maydell's avatar Peter Maydell
Browse files

Merge remote-tracking branch 'remotes/dgilbert/tags/pull-migration-20170228a' into staging



Migration pull

Note: The 'postcopy: Update userfaultfd.h header' is part of
Paolo's header update and will disappear if applied after it.

Signed-off-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>

# gpg: Signature made Tue 28 Feb 2017 12:38:34 GMT
# gpg:                using RSA key 0x0516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <dgilbert@redhat.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A  9FA9 0516 331E BC5B FDE7

* remotes/dgilbert/tags/pull-migration-20170228a: (27 commits)
  postcopy: Add extra check for COPY function
  postcopy: Add doc about hugepages and postcopy
  postcopy: Check for userfault+hugepage feature
  postcopy: Update userfaultfd.h header
  postcopy: Allow hugepages
  postcopy: Send whole huge pages
  postcopy: Mask fault addresses to huge page boundary
  postcopy: Load huge pages in one go
  postcopy: Use temporary for placing zero huge pages
  postcopy: Plumb pagesize down into place helpers
  postcopy: Record largest page size
  postcopy: enhance ram_block_discard_range for hugepages
  exec: ram_block_discard_range
  postcopy: Chunk discards for hugepages
  postcopy: Transmit and compare individual page sizes
  postcopy: Transmit ram size summary word
  migration: fix use-after-free of to_dst_file
  migration: Update docs to discourage version bumps
  migration: fix id leak regression
  migrate: Introduce a 'dc->vmsd' check to avoid segfault for --only-migratable
  ...

Signed-off-by: default avatarPeter Maydell <peter.maydell@linaro.org>
parents c9fc677a 665414ad
Loading
Loading
Loading
Loading
+71 −0
Original line number Diff line number Diff line
@@ -161,6 +161,11 @@ include/hw/hw.h.

=== More about versions ===

Version numbers are intended for major incompatible changes to the
migration of a device, and using them breaks backwards-migration
compatibility; in general most changes can be made by adding Subsections
(see below) or _TEST macros (see below) which won't break compatibility.

You can see that there are several version fields:

- version_id: the maximum version_id supported by VMState for that device.
@@ -175,6 +180,9 @@ version_id. And the function load_state_old() (if present) is able to
load state from minimum_version_id_old to minimum_version_id.  This
function is deprecated and will be removed when no more users are left.

Saving state will always create a section with the 'version_id' value
and thus can't be loaded by any older QEMU.

===  Massaging functions ===

Sometimes, it is not enough to be able to save the state directly
@@ -292,6 +300,56 @@ save/send this state when we are in the middle of a pio operation
not enabled, the values on that fields are garbage and don't need to
be sent.

Using a condition function that checks a 'property' to determine whether
to send a subsection allows backwards migration compatibility when
new subsections are added.

For example;
   a) Add a new property using DEFINE_PROP_BOOL - e.g. support-foo and
      default it to true.
   b) Add an entry to the HW_COMPAT_ for the previous version
      that sets the property to false.
   c) Add a static bool  support_foo function that tests the property.
   d) Add a subsection with a .needed set to the support_foo function
   e) (potentially) Add a pre_load that sets up a default value for 'foo'
      to be used if the subsection isn't loaded.

Now that subsection will not be generated when using an older
machine type and the migration stream will be accepted by older
QEMU versions. pre-load functions can be used to initialise state
on the newer version so that they default to suitable values
when loading streams created by older QEMU versions that do not
generate the subsection.

In some cases subsections are added for data that had been accidentally
omitted by earlier versions; if the missing data causes the migration
process to succeed but the guest to behave badly then it may be better
to send the subsection and cause the migration to explicitly fail
with the unknown subsection error.   If the bad behaviour only happens
with certain data values, making the subsection conditional on
the data value (rather than the machine type) allows migrations to succeed
in most cases.  In general the preference is to tie the subsection to
the machine type, and allow reliable migrations, unless the behaviour
from omission of the subsection is really bad.

= Not sending existing elements =

Sometimes members of the VMState are no longer needed;
  removing them will break migration compatibility
  making them version dependent and bumping the version will break backwards
   migration compatibility.

The best way is to:
  a) Add a new property/compatibility/function in the same way for subsections
    above.
  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
     VMSTATE_UINT32(foo, barstruct)
   becomes
     VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)

  Sometime in the future when we no longer care about the ancient
versions these can be killed off.

= Return path =

In most migration scenarios there is only a single data path that runs
@@ -482,3 +540,16 @@ request for a page that has already been sent is ignored. Duplicate requests
such as this can happen as a page is sent at about the same time the
destination accesses it.

=== Postcopy with hugepages ===

Postcopy now works with hugetlbfs backed memory:
  a) The linux kernel on the destination must support userfault on hugepages.
  b) The huge-page configuration on the source and destination VMs must be
     identical; i.e. RAMBlocks on both sides must use the same page size.
  c) Note that -mem-path /dev/hugepages  will fall back to allocating normal
     RAM if it doesn't have enough hugepages, triggering (b) to fail.
     Using -mem-prealloc enforces the allocation using hugepages.
  d) Care should be taken with the size of hugepage used; postcopy with 2MB
     hugepages works well, however 1GB hugepages are likely to be problematic
     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
     and until the full page is transferred the destination thread is blocked.
+83 −0
Original line number Diff line number Diff line
@@ -45,6 +45,12 @@
#include "exec/address-spaces.h"
#include "sysemu/xen-mapcache.h"
#include "trace-root.h"

#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
#include <fcntl.h>
#include <linux/falloc.h>
#endif

#endif
#include "exec/cpu-all.h"
#include "qemu/rcu_queue.h"
@@ -1518,6 +1524,19 @@ size_t qemu_ram_pagesize(RAMBlock *rb)
    return rb->page_size;
}

/* Returns the largest size of page in use */
size_t qemu_ram_pagesize_largest(void)
{
    RAMBlock *block;
    size_t largest = 0;

    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
        largest = MAX(largest, qemu_ram_pagesize(block));
    }

    return largest;
}

static int memory_try_enable_merging(void *addr, size_t len)
{
    if (!machine_mem_merge(current_machine)) {
@@ -3294,4 +3313,68 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
    rcu_read_unlock();
    return ret;
}

/*
 * Unmap pages of memory from start to start+length such that
 * they a) read as 0, b) Trigger whatever fault mechanism
 * the OS provides for postcopy.
 * The pages must be unmapped by the end of the function.
 * Returns: 0 on success, none-0 on failure
 *
 */
int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
{
    int ret = -1;

    uint8_t *host_startaddr = rb->host + start;

    if ((uintptr_t)host_startaddr & (rb->page_size - 1)) {
        error_report("ram_block_discard_range: Unaligned start address: %p",
                     host_startaddr);
        goto err;
    }

    if ((start + length) <= rb->used_length) {
        uint8_t *host_endaddr = host_startaddr + length;
        if ((uintptr_t)host_endaddr & (rb->page_size - 1)) {
            error_report("ram_block_discard_range: Unaligned end address: %p",
                         host_endaddr);
            goto err;
        }

        errno = ENOTSUP; /* If we are missing MADVISE etc */

        if (rb->page_size == qemu_host_page_size) {
#if defined(CONFIG_MADVISE)
            /* Note: We need the madvise MADV_DONTNEED behaviour of definitely
             * freeing the page.
             */
            ret = madvise(host_startaddr, length, MADV_DONTNEED);
#endif
        } else {
            /* Huge page case  - unfortunately it can't do DONTNEED, but
             * it can do the equivalent by FALLOC_FL_PUNCH_HOLE in the
             * huge page file.
             */
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                            start, length);
#endif
        }
        if (ret) {
            ret = -errno;
            error_report("ram_block_discard_range: Failed to discard range "
                         "%s:%" PRIx64 " +%zx (%d)",
                         rb->idstr, start, length, ret);
        }
    } else {
        error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
                     "/%zx/" RAM_ADDR_FMT")",
                     rb->idstr, start, length, rb->used_length);
    }

err:
    return ret;
}

#endif
+7 −0
Original line number Diff line number Diff line
@@ -37,6 +37,7 @@
#include "hw/boards.h"
#include "hw/sysbus.h"
#include "qapi-event.h"
#include "migration/migration.h"

int qdev_hotplug = 0;
static bool qdev_hot_added = false;
@@ -903,6 +904,7 @@ static void device_set_realized(Object *obj, bool value, Error **errp)
    Error *local_err = NULL;
    bool unattached_parent = false;
    static int unattached_count;
    int ret;

    if (dev->hotplugged && !dc->hotpluggable) {
        error_setg(errp, QERR_DEVICE_NO_HOTPLUG, object_get_typename(obj));
@@ -910,6 +912,11 @@ static void device_set_realized(Object *obj, bool value, Error **errp)
    }

    if (value && !dev->realized) {
        ret = check_migratable(obj, &local_err);
        if (ret < 0) {
            goto fail;
        }

        if (!obj->parent) {
            gchar *name = g_strdup_printf("device[%d]", unattached_count++);

+0 −19
Original line number Diff line number Diff line
@@ -8,7 +8,6 @@
#include "monitor/monitor.h"
#include "trace.h"
#include "qemu/cutils.h"
#include "migration/migration.h"

static void usb_bus_dev_print(Monitor *mon, DeviceState *qdev, int indent);

@@ -688,8 +687,6 @@ USBDevice *usbdevice_create(const char *cmdline)
    const char *params;
    int len;
    USBDevice *dev;
    ObjectClass *klass;
    DeviceClass *dc;

    params = strchr(cmdline,':');
    if (params) {
@@ -724,22 +721,6 @@ USBDevice *usbdevice_create(const char *cmdline)
        return NULL;
    }

    klass = object_class_by_name(f->name);
    if (klass == NULL) {
        error_report("Device '%s' not found", f->name);
        return NULL;
    }

    dc = DEVICE_CLASS(klass);

    if (only_migratable) {
        if (dc->vmsd->unmigratable) {
            error_report("Device %s is not migratable, but --only-migratable "
                         "was specified", f->name);
            return NULL;
        }
    }

    if (f->usbdevice_init) {
        dev = f->usbdevice_init(bus, params);
    } else {
+2 −0
Original line number Diff line number Diff line
@@ -64,6 +64,7 @@ void qemu_ram_set_idstr(RAMBlock *block, const char *name, DeviceState *dev);
void qemu_ram_unset_idstr(RAMBlock *block);
const char *qemu_ram_get_idstr(RAMBlock *rb);
size_t qemu_ram_pagesize(RAMBlock *block);
size_t qemu_ram_pagesize_largest(void);

void cpu_physical_memory_rw(hwaddr addr, uint8_t *buf,
                            int len, int is_write);
@@ -105,6 +106,7 @@ typedef int (RAMBlockIterFunc)(const char *block_name, void *host_addr,
    ram_addr_t offset, ram_addr_t length, void *opaque);

int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);

#endif

Loading