-
Notifications
You must be signed in to change notification settings - Fork 292
Raise VIF limit from 7 to 16 by calculating max_grant_frames
on domain creation
#6577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Raise VIF limit from 7 to 16 by calculating max_grant_frames
on domain creation
#6577
Conversation
…n creation max_maptrack_frames should only be >0 if there are reasons for other domains to grant pages to the domain being created, which should only be happening for Dom0 (not handled by the toolstack), and driver domains (currently none exist). Signed-off-by: Andrii Sultanov <[email protected]> Suggested-by: Andrew Cooper <[email protected]>
Plumb the expected number of VBDs and VIFs to xenopsd's VM_create micro-op, allowing the domain creation code to estimate how big max_grant_frames value needs to be. Previously it was hardcoded to 64 but that was found lacking in situations with a lot of VBDs and VIFs under load, as they'd run out of grant table entries and the VM would crash. Now, when fewer VIFs and VBDs are expected, fewer grant table entries are needed, and the other way around. An expectation of 1 hotplugged VIF and VBD each is accounted for as well, this can possibly be incremented in the future if we decide, say, 2 hotplugged devices are common enough. The VBD calculation is complex and approximate, as explained in the code. Size of the ring was calculated with the following program (important because struct sizes can change, and the code comment also outlines how this can be improved in the future with feature detection, in which case more constants would need to be considered), attaching it here because the BLK_RING_SIZE macro is otherwise hard to decipher by hand: ``` // Can be compiled with just gcc test.c // # ./test // 64 // 112 // 32 \#include <stdint.h> \#include <stdio.h> \#include <stddef.h> struct tester { unsigned int req_prod, req_event; unsigned int rsp_prod, rsp_event; uint8_t __pad[48]; int ring; }; typedef uint32_t grant_ref_t; typedef uint16_t blkif_vdev_t; typedef uint64_t blkif_sector_t; \#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 \#define BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST 8 struct blkif_request_segment { grant_ref_t gref; uint8_t first_sect, last_sect; }; struct blkif_request_rw { uint8_t nr_segments; blkif_vdev_t handle; uint32_t _pad1; uint64_t id; blkif_sector_t sector_number; struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; } __attribute__((__packed__)); struct blkif_request_discard { uint8_t flag; blkif_vdev_t _pad1; uint32_t _pad2; uint64_t id; blkif_sector_t sector_number; uint64_t nr_sectors; uint8_t _pad3; } __attribute__((__packed__)); struct blkif_request_other { uint8_t _pad1; blkif_vdev_t _pad2; uint32_t _pad3; uint64_t id; } __attribute__((__packed__)); struct blkif_request_indirect { uint8_t indirect_op; uint16_t nr_segments; uint32_t _pad1; uint64_t id; blkif_sector_t sector_number; blkif_vdev_t handle; uint16_t _pad2; grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST]; uint32_t _pad3; } __attribute__((__packed__)); struct blkif_request { uint8_t operation; union { struct blkif_request_rw rw; struct blkif_request_discard discard; struct blkif_request_other other; struct blkif_request_indirect indirect; } u; } __attribute__((__packed__)); union blkif_sring_entry { \ struct blkif_request req; \ struct blkif_request rsp; \ }; \#define __RD2(_x) (((_x) & 0x00000002) ? 0x2 : ((_x) & 0x1)) \#define __RD4(_x) (((_x) & 0x0000000c) ? __RD2((_x)>>2)<<2 : __RD2(_x)) \#define __RD8(_x) (((_x) & 0x000000f0) ? __RD4((_x)>>4)<<4 : __RD4(_x)) \#define __RD16(_x) (((_x) & 0x0000ff00) ? __RD8((_x)>>8)<<8 : __RD8(_x)) \#define __RD32(_x) (((_x) & 0xffff0000) ? __RD16((_x)>>16)<<16 : __RD16(_x)) int main () { size_t offset =offsetof(struct tester, ring); size_t size = sizeof(union blkif_sring_entry); printf("%d\n", offset); printf("%zu\n", size); printf("%d\n", __RD32((4096- offset)/size)); } ``` Signed-off-by: Andrii Sultanov <[email protected]>
64 was the old hard-coded value for max_grant_frames, so play it safe here and keep it as the lower bound - users might be used to being able to hotplug several VIFs below 7, for example, which would have been broken otherwise as we only estimate for a single hotplug with the current algorithm. Signed-off-by: Andrii Sultanov <[email protected]>
Signed-off-by: Andrii Sultanov <[email protected]>
With the previously hardcoded value of max_grant_frames, we could only support 7 VIFs at most (and fewer if there were also many VBDs), since having 9 and more VIFs would result in grant allocation errors. Now that max_grant_frames is dynamically estimated on domain creation given the number of VIFs and VBDs a VM has, we can easily support 16 VIFs. Given the current behaviour of the XenServer/XCP-ng system (hypervisor+drivers), where more VIFs allow for higher overall networking throughput, this is highly beneficial - in testing overall throughput with 16 VIFs was 18-27% higher than with 8 VIFs (tested with multiple iperf3 instances running on all interfaces simultaneously) Moreover, some users coming from VMWare are used to networking setups with dozens of VIFs, and this is a step towards allowing that without encountering any other bottlenecks in the system. NOTE: We are currently only allocating enough grants for 1 hotplugged VIF above 7. Therefore, technically we shouldn't allow creating more than one VIF when the VM is running (or paused), but there is currently no way in xapi to check how many and which VIFs were hotplugged as far as I know, and allowed_VIFs are honoured by clients on VIF creation, not hotplug. Since this is in keeping with the previous behaviour of this field (if VM has many VBDs, it wouldn't have been able to allocate 7 VIFs before) as an "advice", and not a "guarantee" or "limit", I've decided to keep it as-is. A more detailed technical explanation of the supported limit should be described elsewhere in support statements. Signed-off-by: Andrii Sultanov <[email protected]>
I think this needs to be reflected in the estimated memory overhead per VM that xapi calculates, to keep it pessimistic: https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/memory_check.ml#L241 |
I think there was also a problem in general even with the current limits that if you had too many VBDs then your VIFs stopped working, or the other way around, which we tried to work around by calculating the number of grant frames dynamically, but that ran into a problem with a particular GFS2 test that kept failing (for seemingly unrelated reasons). I forgot where we left off with that test though, was it basically working? |
that's true - i note a similar case in the last commit
it was basically working, but xenopsd timed out on plugging VBDs and we never got to the bottom of it. |
There is a limit of 8 in netback, but checking the source code that is a limit for the number of queues (on the same interface), not the number of interfaces. So I assume in your tests the guests had <16 vCPUs? It'd be useful if your table contained the effective number of network queues (you can see this in xenstore, there are entries like |
I tested with 8 vCPUs, yes
There are 8 queues (0-7) for every VIF in my tests. |
I'm not quite sure how that number was calculated initially ( With the new approach, there is no inherent limit, the more VBDs/VIFs a domain has, the more it would be able to allocate, we can't get a pessimistic estimate here, |
I would start by adding in the assumption of 16 VIFs used per domain, and account the grant frames needed to use them. I agree with your point of of the current memory model not being fit for purpose. In the long term an extension to make it aware of the grant frames that each domain will be able to use would be best |
Although - the granted pages should be counted towards the domain's normal memory usage, not the overhead, unless I'm confusing things here. |
It is? then disregard my comment, I though it was counted as overhead |
These constants come from the upstream |
xenopsd currently hardcodes 64 as the value for
max_grant_frames
for all domains. This limits how many grants a domain can allocate, thus limiting the number of VIFs and VBDs a domain can have.max_grant_frames
can be changed for individual VMs through specifying a particular value inplatform
, but it's much better to be able to estimate how many grant frames a VM would need based on the number of VBDs and VIFs.Implement this and add some notes on the difficulties and imprecisions of such an estimation.
This allows raising the number of supported VIFs from the current limit of 7 - we've picked 16 as a sweet spot, though this could be discussed further. (note that it's the
allowed_vifs
limit that's changed, VIFs can be created in arbitrary numbers on the CLI and by clients not honouring theallowed_vifs
advice).Given the current behaviour of the XenServer/XCP-ng system (hypervisor+drivers), where more VIFs allow for higher overall networking throughput, this is highly beneficial - in testing overall throughput with 16 VIFs was 18-27% higher than with 8 VIFs (tested with multiple iperf3 instances running on all interfaces simultaneously). The following table shows the performance per-VIF and per-VM on a host with 16 pcpus with 8 vCPUs for domU and dom0 each:
Moreover, some users coming from VMWare are used to networking setups with dozens of VIFs, and this is a step towards allowing that without encountering any other bottlenecks in the system.
Most of this work (except the last commit) was tested very thoroughly at XenServer, since it was initially intended to raise the supported limit of VBDs (it passed Ring3 BST+BVT multiple times, and ran through the config limits test several times). Raising the number of supported VBDs has other issues so was abandoned, but this PR solves all the issues for VIFs.
The new limit was tested with XCP-ng, with a new test being added to our CI (xcp-ng/xcp-ng-tests#338), which will verify the ability to hit the new limit and perform well at it.