Description
Description
Creating a VM on a hypervisor node (MHV) causes VPP to become unresponsive after a period of time. This was tested against 17.01-rc23_g2d7e163b15.x86_64 in an open stack environment using the neutron ML2 plugin for VPP (https://github.com/openstack/networking-vpp) and virtio for the NIC.
Reproduction steps:
1. Install vpp from 1701 stable repo and also vpp-agent
-bash-4.2# rpm -qa|grep vpp
vpp-devel-17.01-rc23_g2d7e163b15.x86_64
vpp-17.01-rc23_g2d7e163b15.x86_64
networking-vpp-0.0.1.dev121-2.noarch
vpp-plugins-17.01-rc23_g2d7e163b15.x86_64
vpp-python-api-17.01-rc23_g2d7e163b15.x86_64
vpp-lib-17.01-rc23_g2d7e163b15.x86_64
2. Create a network, and also a vm on that network.
The namespace/port created on mcp
--bash-4.2# vppctl show bridge-domain 4 detail
ID Index Learning U-Forwrd UU-Flood Flooding ARP-Term BVI-Intf
4 1 on on on on off N/A
Interface Index SHG BVI TxFlood VLAN-Tag-Rewrite
BondEthernet0.2066 4 0 - * pop-1
tap-0 5 0 - * none</p>
The vhost interface created on mhv
-bash-4.2# vppctl show vhost
Virtio vhost-user interfaces
Global:
coalesce frames 32 time 1e-3
Interface: VirtualEthernet0/0/0 (ifindex 4)
virtio_net_hdr_sz 12
features mask (0xffffffffffffffff):
features (0x50008000):
VIRTIO_NET_F_MRG_RXBUF (15)
VIRTIO_F_INDIRECT_DESC (28)
VHOST_USER_F_PROTOCOL_FEATURES (30)
protocol features (0x3)
VHOST_USER_PROTOCOL_F_MQ (0)
VHOST_USER_PROTOCOL_F_LOG_SHMFD (1)
socket filename /tmp/7853dd37-6796-476d-bb8b-ae2267acb59e type client errno "Success"
rx placement:
thread 0 on vring 1
tx placement: lock-free
thread 0 on vring 0
Memory regions (total 2)
region fd guest_phys_addr memory_size userspace_addr mmap_offset mmap_addr
====== ===== ================== ================== ================== ================== ==================
0 21 0x0000000000000000 0x00000000000a0000 0x00002aaaaac00000 0x0000000000000000 0x00002aaaaac00000
1 22 0x00000000000c0000 0x000000001ff40000 0x00002aaaaacc0000 0x00000000000c0000 0x00002aaaaaec0000
Virtqueue 0 (TX)
qsz 256 last_avail_idx 81 last_used_idx 81
avail.flags 0 avail.idx 256 used.flags 1 used.idx 81
kickfd 23 callfd 24 errfd -1
Virtqueue 1 (RX)
qsz 256 last_avail_idx 118 last_used_idx 118
avail.flags 1 avail.idx 118 used.flags 1 used.idx 118
kickfd 19 callfd 25 errfd -1
3. After a while (5-10min) vpp stopped responding on mhv (all vpp related command no response)
2017-01-10T17:39:10.406206+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer
2017-01-10T17:39:10.408508+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25986: error : qemuProcessReportLogError:1810 : internal error: qemu unexpectedly closed the monitor: QEMU waiting for connection on: di
sconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:10.602566+00:00 mhv2.paslab015001.mc.metacloud.in libvirtd[25986]: 25991: error : qemuProcessReportLogError:1810 : internal error: process exited while connecting to monitor: QEMU waiting for connection on: disconnected:unix:/tmp/fdef6234-82ae-4fc5-94da-944ba5122b9e,server
2017-01-10T17:39:49.027951+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: unix_signal_handler:118: received signal SIGCONT, PC 0x7f56e7785590
2017-01-10T17:39:49.027979+00:00 mhv2.paslab015001.mc.metacloud.in vnet[168332]: received SIGTERM, exiting...
2017-01-10T17:39:49.139973+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Probing VFIO support...
2017-01-10T17:40:00.407293+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
2017-01-10T17:40:00.407587+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Initializing pmd_bond for eth_bond0
2017-01-10T17:40:00.407881+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: Create bonded device eth_bond0 on port 0 in mode 1 on socket 0.
2017-01-10T17:40:00.408138+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:14.0 on NUMA socket -1
2017-01-10T17:40:00.408147+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
2017-01-10T17:40:00.412447+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: PCI device 0000:00:15.0 on NUMA socket -1
2017-01-10T17:40:00.412457+00:00 mhv2.paslab015001.mc.metacloud.in vnet[193762]: EAL: probe driver: 1af4:1000 net_virtio
Further investigation shows it stuck in a FUTEX_WAIT:
-bash-4.2# strace -p 696966
Process 696966 attached
futex(0x305e7f6c, FUTEX_WAIT, 245, NULL
This also happens on nodes running VPP which are not hosting VMs, but are tapping into a linux namespace for DHCP assignment ("MCP nodes"). This happens much more intermittently and has been much more difficult to reproduce.
Assignee
Ole Trøan
Reporter
Kevin Bringard
Comments
- kevinbringard (Mon, 23 Jan 2017 14:52:26 +0000): This issue appears to be resolved with this commit 7104f93 on the stable/1701 branch:
commit 7104f93
Author: Ole Troan [email protected]
Date: Thu Jan 19 09:44:44 2017 +0100
Python API: Missing locking of results data structure.
The wrong assumption that the GIL combined with CPython's "mostly"
thread safe assurance does not hold. The combination of a slow
event handler for notification and calling the API at the same
time let to contention on the results data structure.</p>
Added suitable locking.
Also added an atexit() to attempt a VPP disconnect on shutdown.
Also: lots more comments, docstrings, duplicated code removed.
Some of the problem here was a disagreement between caller
and author as to how the API should be used; the comments should
help.</p>
Change-Id: I0cb7d0026db660ec141425c5ad474f14bacea36e
Signed-off-by: Ole Troan <[email protected]>
Co-Authored-By: Ian Wells <[email protected]>
Signed-off-by: Ole Troan <[email protected]></p>
Thanks for all the hard work everyone!
- kevinbringard (Tue, 17 Jan 2017 15:19:52 +0000): If it's useful at all, this looks to be the point at which it hangs:
- cat /proc/169981/stack
[] futex_wait_queue_me+0xc4/0x120
[] futex_wait+0x179/0x280
[] do_futex+0xfe/0x5b0
[] SyS_futex+0x80/0x180
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
I have a theory what's going on. How hard would it be to try a private image?
-
kevinbringard (Thu, 12 Jan 2017 17:29:57 +0000):
(gdb) info threads
Id Target Id Frame
3 Thread 0x7fe9614fc700 (LWP 24843) "eal-intr-thread" 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
2 Thread 0x7fe960cfb700 (LWP 24844) "vpp_stats" 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
- 1 Thread 0x7fe9e6fd2900 (LWP 24738) "vpp_main" 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fe9e6fd2900 (LWP 24738))]
#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fe9e4f6d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fe9e69ac0a2 in unix_shared_memory_queue_add () from /lib64/libvlibmemory.so.0
#2 0x00000000005bca26 in vl_api_control_ping_t_handler ()
#3 0x00007fe9e6bb9b73 in vl_msg_api_handler_with_vm_node () from /lib64/libvlibapi.so.0
#4 0x00007fe9e69a5763 in memclnt_process () from /lib64/libvlibmemory.so.0
#5 0x00007fe9e6536e26 in vlib_process_bootstrap () from /lib64/libvlib.so.0
#6 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0
#7 0x00007fe9a45a0e30 in ?? ()
#8 0x00007fe9e6537ec1 in dispatch_process.constprop.14 () from /lib64/libvlib.so.0
#9 0x0000001700000010 in ?? ()
#10 0x736c706d0000003b in ?? ()
#11 0x636e692d6a64612d in ?? ()
#12 0x206574656c706d6f in ?? ()
#13 0x4552472d534c504d in ?? ()
#14 0x70207475706e6920 in ?? ()
#15 0x64207374656b6361 in ?? ()
#16 0x616c757370616365 in ?? ()
#17 0x006425203a646574 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe960cfb700 (LWP 24844))]
#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fe9e4f7096d in nanosleep () from /lib64/libpthread.so.0
#1 0x000000000059e026 in ip46_fib_stats_delay.isra.5 ()
#2 0x00000000005a4d5a in stats_thread_fn ()
#3 0x00007fe9e54abd20 in clib_calljmp () from /lib64/libvppinfra.so.0
#4 0x00007fe960cfad60 in ?? ()
#5 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0
#6 0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe9614fc700 (LWP 24843))]
#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fe9e46ac7a3 in epoll_wait () from /lib64/libc.so.6
#1 0x00000000004f9094 in eal_intr_thread_main ()
#2 0x00007fe9e4f69dc5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fe9e46ac1cd in clone () from /lib64/libc.so.6
Original issue: https://jira.fd.io/browse/VPP-593