Guest RAM


Q&A

  • How can we convert a virt addr in QEMU to a (faked) physical addr?

Reference 1 2

Memory Regions

Entire physical memory is modelled as an acyclic graph of MemoryRegion objects 1. Sinks(leaves) are RAM and MIMO regions. While other nodes represents buses, memory controllers, and memory regions that have been rerouted.

In addition to MemoryRegsion objects, the memory API provides AddressSpace objects for every root and possibly for intermediate MemoryRegsions too. These represent memory as seen from the CPU or a device’s viewpoint.

Types of Regions

All are C type MemoryRegion:

  • RAM. A range of host memory available to the guest. Init with
    • memory_region_init_ram(), or
    • memory_region_init_resizeable_ram(),
    • memory_region_init_ram_from_file(),
    • memory_region_init_ram_ptr()
  • MIMO. A range of guest memory that is implemented by host callbacks; each read and write cause a callback to be called on the host. Init with memory_region_init_io(), passing it a MemoryRegsionOps structure describing the callbacks.
  • ROM. A ROM memory region works like RAM for reads(directly accessing a region of a host memory), and forbids writes. Init with memory_region_init_rom().
  • ROM device. A memory region can be read like RAM, and written like MMIO (via callbacks). Init with memory_region_init_rom_device.
  • IOMMU. Address translation, and forwards access to other target memory region. Only for IOMMU, not for simple device. Init with memory_region_init_iommu().
  • container. A container is a set of other memory regions, each with a different offset. Containers are useful for grouping several regions into one unit. For example, a PCI BAR may be composed of a RAM region and an MMIO region. Different containers can contain overlapped regions: for example a memory controller that can overlay a subregion of RAM with MMIO or ROM, or a PCI controller that does not prevent card from claiming overlapping BARs. Init with memory_region_init().
  • alias. A subsection of another region. Aliases allow a region to be split apart into discontiguous regions. Examples of uses are memory banks used when the guest address space is smaller than the amount of RAM addressed, or a memory controller that splits main memory to expose a “PCI hole”. Aliases may point to any type of region, including other aliases, but an alias may not point back to itself, directly or indirectly. You initialize these with memory_region_init_alias().
  • reservation region. A reservation region is primarily for debugging. It claims I/O space that is not supposed to be handled by QEMU itself. The typical use is to track parts of the address space which will be handled by the host kernel when KVM is enabled. You initialize these by passing a NULL callback parameter to memory_region_init_io().

Example Memory Map (simplified):

system_memory: container@0-2^48-1
 |
 +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff)
 |
 +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff)
 |
 +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff)
 |      (prio 1)
 |
 +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff)

pci (0-2^32-1)
 |
 +--- vga-area: container@0xa0000-0xbffff
 |      |
 |      +--- alias@0x00000-0x7fff  ---> #vram (0x010000-0x017fff)
 |      |
 |      +--- alias@0x08000-0xffff  ---> #vram (0x020000-0x027fff)
 |
 +---- vram: ram@0xe1000000-0xe1ffffff
 |
 +---- vga-mmio: mmio@0xe2000000-0xe200ffff

ram: ram@0x00000000-0xffffffff

Above is a (simplified) PC memory map. The 4GB RAM block is mapped into the system address space via two aliases: “lomem” is a 1:1 mapping of the first 3.5GB; “himem” maps the last 0.5GB at address 4GB. This leaves 0.5GB for the so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with 4GB of memory.

The memory controller diverts addresses in the range 640K-768K to the PCI address space. This is modelled using the “vga-window” alias, mapped at a higher priority so it obscures the RAM at the same addresses. The vga window can be removed by programming the memory controller; this is modelled by removing the alias and exposing the RAM underneath.

The pci address space is not a direct child of the system address space, since we only want parts of it to be visible (we accomplish this using aliases). It has two subregions: vga-area models the legacy vga window and is occupied by two 32K memory banks pointing at two sections of the framebuffer. In addition the vram is mapped as a BAR at address e1000000, and an additional BAR, vga-mmio, containing MMIO registers is mapped after it.

Region Lifecycle

A region is created by one of memory_region_init* functions and attached to an object, which act as its owner or parent.

A region can be added to an address space or a container with memory_region_add_subregion(), and removed using memory_region_del_subregion()

Overlapping Regions and Priority

Priority: 2>1

      0      1000   2000   3000   4000   5000   6000   7000   8000
      |------|------|------|------|------|------|------|------|
A:    [                                                      ]
C:    [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
B:                  [                          ]
D:                  [DDDDD]
E:                                [EEEEE]

Container A@0x0 – 0x8000. Region C @0x0 – 0x6000, priority 1.

Container B@0x2000 – 0x6000, priority 2. Region D @0x0 – 0x1000, E @ 0x2000 – 0x3000

The regions will be seen within this address range:

[CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC]

Overlap created in memory_region_add_subregion_overlap()

Priority can be set in any regions, RAM, containers, alias, etc..

Visibility

Rules to select a memory region when the guest accesses an address:

  • all direct subregions of the root region are matched against the address, in descending priority order

    • if the address lies outside the region offset/size, the subregion is discarded
    • if the subregion is a leaf (RAM or MMIO), the search terminates, returning this leaf region
    • if the subregion is a container, the same algorithm is used within the subregion (after the address is adjusted by the subregion offset)
    • if the subregion is an alias, the search is continued at the alias target (after the address is adjusted by the subregion offset and alias offset)
    • if a recursive search within a container or alias subregion does not find a match (because of a “hole” in the container’s coverage of its address range), then if this is a container with its own MMIO or RAM backing the search terminates, returning the container itself. Otherwise we continue with the next subregion in priority order
  • if none of the subregions match the address then the search terminates with no match found

From the Blog2:

Implementation:

Guest RAM: memory backend (-m [size=]megs) + hotpluggable guest memory (DIMM, pc-dimm, slots=n, maxmem=size)

The “pc-dimm” and “memory-backend” objects are user-visible parts of guest RAM in QEMU. They can be managed using the QEMU command-line and QMP monitor interface.

Hotpluggable guest physical mem

Defined in hw/mem/pc-dimm.c. A pc-dimm device models a DIMM.

A pc-dimm must be associated with a “memory-backend” object.

Memory backends

Defined in backends/hostmem.c

Contains the actual host memory that backs guest RAM. Can either be anonymous mmapped memory or file-backed mmapped memory. (File-backed guest RAM allows Linux hugetlbfs usage for huge pages on the host and also shared-memory so other host applications can access to guest RAM).

RAMBlock:

Memory inside a “memory-backend” is acutally mmapped by RAMBlock through qemu_ram_alloc() in exec.c. Each RAMBlock has a pointer to the mmap memory and also a ram_addr_t offset. The ram_addr_t offset is in the global namespace and is used to identify the RAMBlock.

However, ram_addr_t namespace is just a part of the entire guest physical memory space. It is tightly packed address space containing all RAMBlocks. But some guest physical memory regions, such as reserved memory, memory mapped I/O, etc., are not being identified by ram_addr_t.

All RAMBlocks are in a global list RAMList.

Definition of RAMBlock

// include/exec/ramblock.h

struct RAMBlock {
    struct rcu_head rcu;
    struct MemoryRegion *mr;
    uint8_t *host;
    uint8_t *colo_cache; /* For colo, VM's ram cache */
    ram_addr_t offset; // Lele: offset used for dirty bitmap
    ram_addr_t used_length;
    ram_addr_t max_length;
    void (*resized)(const char*, uint64_t length, void *host);
    uint32_t flags;
    /* Protected by iothread lock.  */
    char idstr[256];
    /* RCU-enabled, writes protected by the ramlist lock */
    QLIST_ENTRY(RAMBlock) next;
    QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
    int fd;
    size_t page_size;
    /* dirty bitmap used during migration */
    unsigned long *bmap;
    /* bitmap of already received pages in postcopy */
    unsigned long *receivedmap;

    /* Bitmap of CHERI tag bits */
    struct CheriTagMem *cheri_tags;

    /*
     * bitmap to track already cleared dirty bitmap.  When the bit is
     * set, it means the corresponding memory chunk needs a log-clear.
     * Set this up to non-NULL to enable the capability to postpone
     * and split clearing of dirty bitmap on the remote node (e.g.,
     * KVM).  The bitmap will be set only when doing global sync.
     *
     * NOTE: this bitmap is different comparing to the other bitmaps
     * in that one bit can represent multiple guest pages (which is
     * decided by the `clear_bmap_shift' variable below).  On
     * destination side, this should always be NULL, and the variable
     * `clear_bmap_shift' is meaningless.
     */
    unsigned long *clear_bmap;
    uint8_t clear_bmap_shift;
};

Definition of MemoryRegion:

// include/exec/memory.h

/** MemoryRegion:
 *
 * A struct representing a memory region.
 */
struct MemoryRegion {
    Object parent_obj;

    /* private: */

    /* The following fields should fit in a cache line */
    bool romd_mode;
    bool ram;
    bool subpage;
    bool readonly; /* For RAM regions */
    bool nonvolatile;
    bool rom_device;
    bool flush_coalesced_mmio;
    bool global_locking;
    uint8_t dirty_log_mask;
    bool is_iommu;
    RAMBlock *ram_block;
    Object *owner;

    const MemoryRegionOps *ops;
    void *opaque;
    MemoryRegion *container;
    Int128 size;
    hwaddr addr;
    void (*destructor)(MemoryRegion *mr);
    uint64_t align;
    bool terminates;
    bool ram_device;
    bool enabled;
    bool warning_printed; /* For reservations */
    uint8_t vga_logging_count;
    MemoryRegion *alias;
    hwaddr alias_offset;
    int32_t priority;
    QTAILQ_HEAD(, MemoryRegion) subregions;
    QTAILQ_ENTRY(MemoryRegion) subregions_link;
    QTAILQ_HEAD(, CoalescedMemoryRange) coalesced;
    const char *name;
    unsigned ioeventfd_nb;
    MemoryRegionIoeventfd *ioeventfds;
};
  • Code
  • Q&A Where is the virt to physical address translated? What is the fast path without tlb? Addr Translation tlb_vaddr_to_host In accel/tcg/cputlb.c: tlb_vaddr_to_host. If tlb hit: return host vaddr as guest physical address. If tlb miss: tlb_fill: Called to resize the TLB. All the caller’s prior references to the TLB table must be discard and looked up again via tlb_entry(). // accel/tcg/cputlb.c: void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr, MMUAccessType access_type, int mmu_idx) { CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); target_ulong tlb_addr, page; //.

Created Jun 20, 2020 // Last Updated Aug 12, 2020

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?