Skip to content

Raise VIF limit from 7 to 16 by calculating max_grant_frames on domain creation #6577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

last-genius
Copy link
Contributor

xenopsd currently hardcodes 64 as the value for max_grant_frames for all domains. This limits how many grants a domain can allocate, thus limiting the number of VIFs and VBDs a domain can have. max_grant_frames can be changed for individual VMs through specifying a particular value in platform, but it's much better to be able to estimate how many grant frames a VM would need based on the number of VBDs and VIFs.

Implement this and add some notes on the difficulties and imprecisions of such an estimation.

This allows raising the number of supported VIFs from the current limit of 7 - we've picked 16 as a sweet spot, though this could be discussed further. (note that it's the allowed_vifs limit that's changed, VIFs can be created in arbitrary numbers on the CLI and by clients not honouring the allowed_vifs advice).

Given the current behaviour of the XenServer/XCP-ng system (hypervisor+drivers), where more VIFs allow for higher overall networking throughput, this is highly beneficial - in testing overall throughput with 16 VIFs was 18-27% higher than with 8 VIFs (tested with multiple iperf3 instances running on all interfaces simultaneously). The following table shows the performance per-VIF and per-VM on a host with 16 pcpus with 8 vCPUs for domU and dom0 each:

image

Moreover, some users coming from VMWare are used to networking setups with dozens of VIFs, and this is a step towards allowing that without encountering any other bottlenecks in the system.

Most of this work (except the last commit) was tested very thoroughly at XenServer, since it was initially intended to raise the supported limit of VBDs (it passed Ring3 BST+BVT multiple times, and ran through the config limits test several times). Raising the number of supported VBDs has other issues so was abandoned, but this PR solves all the issues for VIFs.
The new limit was tested with XCP-ng, with a new test being added to our CI (xcp-ng/xcp-ng-tests#338), which will verify the ability to hit the new limit and perform well at it.

Andrii Sultanov and others added 5 commits July 4, 2025 15:13
…n creation

max_maptrack_frames should only be >0 if there are reasons for other domains to
grant pages to the domain being created, which should only be happening for Dom0
(not handled by the toolstack), and driver domains (currently none exist).

Signed-off-by: Andrii Sultanov <[email protected]>
Suggested-by: Andrew Cooper <[email protected]>
Plumb the expected number of VBDs and VIFs to xenopsd's VM_create micro-op,
allowing the domain creation code to estimate how big max_grant_frames
value needs to be. Previously it was hardcoded to 64 but that was found lacking
in situations with a lot of VBDs and VIFs under load, as they'd run out of grant
table entries and the VM would crash.

Now, when fewer VIFs and VBDs are expected, fewer grant table entries are
needed, and the other way around. An expectation of 1 hotplugged VIF and VBD
each is accounted for as well, this can possibly be incremented in the future
if we decide, say, 2 hotplugged devices are common enough.

The VBD calculation is complex and approximate, as explained in the code.
Size of the ring was calculated with the following program (important because
struct sizes can change, and the code comment also outlines how this can be
improved in the future with feature detection, in which case more constants
would need to be considered), attaching it here because the BLK_RING_SIZE
macro is otherwise hard to decipher by hand:

```
// Can be compiled with just gcc test.c
// # ./test
// 64
// 112
// 32
\#include <stdint.h>
\#include <stdio.h>
\#include <stddef.h>

struct tester {
    unsigned int req_prod, req_event;
    unsigned int rsp_prod, rsp_event;
    uint8_t __pad[48];
    int ring;
};

typedef uint32_t grant_ref_t;
typedef uint16_t blkif_vdev_t;
typedef uint64_t blkif_sector_t;
\#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
\#define BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST 8

struct blkif_request_segment {
		grant_ref_t gref;
		uint8_t     first_sect, last_sect;
};

struct blkif_request_rw {
	uint8_t        nr_segments;
	blkif_vdev_t   handle;
	uint32_t       _pad1;
	uint64_t       id;
	blkif_sector_t sector_number;
	struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
} __attribute__((__packed__));

struct blkif_request_discard {
	uint8_t        flag;
	blkif_vdev_t   _pad1;
	uint32_t       _pad2;
	uint64_t       id;
	blkif_sector_t sector_number;
	uint64_t       nr_sectors;
	uint8_t        _pad3;
} __attribute__((__packed__));

struct blkif_request_other {
	uint8_t      _pad1;
	blkif_vdev_t _pad2;
	uint32_t     _pad3;
	uint64_t     id;
} __attribute__((__packed__));

struct blkif_request_indirect {
	uint8_t        indirect_op;
	uint16_t       nr_segments;
	uint32_t       _pad1;
	uint64_t       id;
	blkif_sector_t sector_number;
	blkif_vdev_t   handle;
	uint16_t       _pad2;
	grant_ref_t    indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
	uint32_t      _pad3;
} __attribute__((__packed__));

struct blkif_request {
	uint8_t        operation;
	union {
		struct blkif_request_rw rw;
		struct blkif_request_discard discard;
		struct blkif_request_other other;
		struct blkif_request_indirect indirect;
	} u;
} __attribute__((__packed__));

union blkif_sring_entry {                                            \
    struct blkif_request req;                                                        \
    struct blkif_request rsp;                                                        \
};

\#define __RD2(_x)  (((_x) & 0x00000002) ? 0x2                  : ((_x) & 0x1))
\#define __RD4(_x)  (((_x) & 0x0000000c) ? __RD2((_x)>>2)<<2    : __RD2(_x))
\#define __RD8(_x)  (((_x) & 0x000000f0) ? __RD4((_x)>>4)<<4    : __RD4(_x))
\#define __RD16(_x) (((_x) & 0x0000ff00) ? __RD8((_x)>>8)<<8    : __RD8(_x))
\#define __RD32(_x) (((_x) & 0xffff0000) ? __RD16((_x)>>16)<<16 : __RD16(_x))

int main () {
	size_t offset =offsetof(struct tester, ring);
	size_t size = sizeof(union blkif_sring_entry);
	printf("%d\n", offset);
	printf("%zu\n", size);
	printf("%d\n", __RD32((4096- offset)/size));
}
```

Signed-off-by: Andrii Sultanov <[email protected]>
64 was the old hard-coded value for max_grant_frames, so play it safe here and
keep it as the lower bound - users might be used to being able to hotplug
several VIFs below 7, for example, which would have been broken otherwise as
we only estimate for a single hotplug with the current algorithm.

Signed-off-by: Andrii Sultanov <[email protected]>
With the previously hardcoded value of max_grant_frames, we could only support
7 VIFs at most (and fewer if there were also many VBDs), since having 9 and
more VIFs would result in grant allocation errors.

Now that max_grant_frames is dynamically estimated on domain creation given the
number of VIFs and VBDs a VM has, we can easily support 16 VIFs.

Given the current behaviour of the XenServer/XCP-ng system
(hypervisor+drivers), where more VIFs allow for higher overall networking
throughput, this is highly beneficial - in testing overall throughput with 16
VIFs was 18-27% higher than with 8 VIFs (tested with multiple iperf3 instances
running on all interfaces simultaneously)

Moreover, some users coming from VMWare are used to networking setups with
dozens of VIFs, and this is a step towards allowing that without encountering
any other bottlenecks in the system.

NOTE: We are currently only allocating enough grants for 1 hotplugged VIF
above 7. Therefore, technically we shouldn't allow creating more than one VIF
when the VM is running (or paused), but there is currently no way in xapi to
check how many and which VIFs were hotplugged as far as I know, and
allowed_VIFs are honoured by clients on VIF creation, not hotplug.
Since this is in keeping with the previous behaviour of this field (if VM has
many VBDs, it wouldn't have been able to allocate 7 VIFs before) as an
"advice", and not a "guarantee" or "limit", I've decided to keep it as-is.
A more detailed technical explanation of the supported limit should be
described elsewhere in support statements.

Signed-off-by: Andrii Sultanov <[email protected]>
@psafont
Copy link
Member

psafont commented Jul 7, 2025

I think this needs to be reflected in the estimated memory overhead per VM that xapi calculates, to keep it pessimistic: https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/memory_check.ml#L241

@edwintorok
Copy link
Contributor

edwintorok commented Jul 7, 2025

I think there was also a problem in general even with the current limits that if you had too many VBDs then your VIFs stopped working, or the other way around, which we tried to work around by calculating the number of grant frames dynamically, but that ran into a problem with a particular GFS2 test that kept failing (for seemingly unrelated reasons). I forgot where we left off with that test though, was it basically working?

@last-genius
Copy link
Contributor Author

I think there was also a problem in general even with the current limits that if you had too many VBDs then your VIFs stopped working, or the other way around

that's true - i note a similar case in the last commit

which we tried to work around by calculating the number of grant frames dynamically, but that ran into a problem with a particular GFS2 test that kept failing (for seemingly unrelated reasons). I forgot where we left off with that test though, was it basically working?

it was basically working, but xenopsd timed out on plugging VBDs and we never got to the bottom of it.

@edwintorok
Copy link
Contributor

supported VIFs from the current limit of 7

There is a limit of 8 in netback, but checking the source code that is a limit for the number of queues (on the same interface), not the number of interfaces.
The number of queues is in turn limited by the number of vCPUs. If you give a guest 16 vCPUs and 2 VIFs, it should already be able to use 16 connections in parallel (more is unlikely to help since we only have 16 vCPUs in Dom0).

So I assume in your tests the guests had <16 vCPUs?

It'd be useful if your table contained the effective number of network queues (you can see this in xenstore, there are entries like vif/0/queue-0, vif/0/queue-1, etc.
You don't need to rerun the tests, but I'd hope you have some logs/bugtools from the tests that you've already run and can check the number of vCPUs and number of queues (in xenstore-ls.out, or xenstored-access.log).

@last-genius
Copy link
Contributor Author

last-genius commented Jul 7, 2025

So I assume in your tests the guests had <16 vCPUs?

I tested with 8 vCPUs, yes

It'd be useful if your table contained the effective number of network queues (you can see this in xenstore, there are entries like vif/0/queue-0, vif/0/queue-1, etc.

There are 8 queues (0-7) for every VIF in my tests.

@last-genius
Copy link
Contributor Author

last-genius commented Jul 14, 2025

I think this needs to be reflected in the estimated memory overhead per VM that xapi calculates, to keep it pessimistic: https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/memory_check.ml#L241

I'm not quite sure how that number was calculated initially (extra_internal_mib is 1 for all models). With my calculations, the current value of max_grant_frames means a domain can allocate up to 64 frames * 512 grants each = 128mb, so the estimate is currently wrong.

With the new approach, there is no inherent limit, the more VBDs/VIFs a domain has, the more it would be able to allocate, we can't get a pessimistic estimate here, memory_check would need to go through the same algorithm xenopsd uses - so it should be made generic

@psafont
Copy link
Member

psafont commented Jul 14, 2025

I would start by adding in the assumption of 16 VIFs used per domain, and account the grant frames needed to use them.

I agree with your point of of the current memory model not being fit for purpose. In the long term an extension to make it aware of the grant frames that each domain will be able to use would be best

@last-genius
Copy link
Contributor Author

Although - the granted pages should be counted towards the domain's normal memory usage, not the overhead, unless I'm confusing things here.

@psafont
Copy link
Member

psafont commented Jul 14, 2025

It is? then disregard my comment, I though it was counted as overhead

@last-genius
Copy link
Contributor Author

These constants come from the upstream xl, I think (extra internal is #define LIBXL_MAXMEM_CONSTANT 1024) - but they aren't exactly the same (e.g. extra external for HVM should be 2, not 1, since it's defined as #define LIBXL_HVM_EXTRA_MEMORY 2048 in xen)....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants