Skip to content

Commit 61f8d06

Browse files
H-Huangfacebook-github-bot
authored andcommitted
Allow ports to be reused in gloo (#97677)
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: #353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: 4a1483d0eceda01ffd02c7747282129f7f4a2efe
1 parent 56b221c commit 61f8d06

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

gloo/transport/tcp/device.cc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,15 @@ static void lookupAddrForHostname(struct attr& attr) {
101101
struct addrinfo* rp;
102102
for (rp = result; rp != nullptr; rp = rp->ai_next) {
103103
auto fd = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
104+
105+
// Set SO_REUSEADDR to signal that reuse of the listening port is OK.
106+
int on = 1;
107+
rv = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<const char*>(&on), sizeof(on));
108+
if (rv == -1) {
109+
close(fd);
110+
GLOO_ENFORCE_NE(rv, -1);
111+
}
112+
104113
if (fd == -1) {
105114
continue;
106115
}

0 commit comments

Comments
 (0)