-
-
Notifications
You must be signed in to change notification settings - Fork 33
Retry on libssh SSH_AGAIN return code #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Retry on libssh SSH_AGAIN return code #756
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN. | ||
:user:`justin-stephenson`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN. | |
:user:`justin-stephenson`. | |
``ssh_userauth_password`` are now retried when ``libssh`` returns ``SSH_AGAIN`` | |
-- by :user:`justin-stephenson`. |
@@ -0,0 +1,3 @@ | |||
The ``Channel`` class calls to libssh ``ssh_channel_open_session`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure classes don't call things. They represent states. Could you rephrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could result in an infinite loop..
Also, I don't think there's a proof that this works as intended without tests. Add them.
Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue. Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself? |
Thank you for your comments and review.
I tried to add a test for this but unfortunately I couldn't reproduce the scenario we see in the test environment. I'll try again.
I can change this PR to add a call to Whichever you prefer is acceptable for us, just let me know and i'll make those changes. |
I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed. What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts? |
The error we currently see in our PRCI is specific to
I added the
In our code we set
If I understand correctly, setting this low -- for reference https://github.com/next-actions/pytest-mh/blob/master/pytest_mh/conn/ssh.py |
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
51b9a9b
to
7afce6d
Compare
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Workaround ansible pylibssh issue which causes test failures pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2] PR ansible/pylibssh#756 is under review but workaround it in the meantime.
Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think? It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override. And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time. |
7afce6d
to
571272e
Compare
I went ahead and updated the PR with your suggestion, by adding a new option
I added some test scaffolding that is not done yet. In the test environment I always see |
The libssh has an option https://api.libssh.org/master/group__libssh__session.html#ga7a801b85800baa3f4e16f5b47db0a73d |
55f23f6
to
94a3c16
Compare
I clicked "rebase" so this PR pulls in the CI fixes. |
The change note was lost in your last force-push. |
94a3c16
to
d3528df
Compare
Thank you. I was able to add 2 tests for this with subsecond timeout: 1 which reproduces the |
It is added now. |
Improve pylibssh handling when libssh ssh_channel_open_session() returns SSH_AGAIN. Add a new 'open_session_retries' session connect() parameter to allow a configurable number of retries. SSH_AGAIN may be returned when setting a low SSH options timeout value. The default option value is 0, no retries will be attempted.
9f10b46
to
0c23524
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is conceptually going in the right direction. It will need some polishing to make the CI and happy. I would also try to consider if other places where this might happen (from my understanding this could happen almost everywhere, but I might be wrong. If I am not, we will need more generic description than we have now).
At least for what we have a coverage could be tried by setting small timeout in the test fixture ssh_session_connect()
to see where it will fail.
SUMMARY
When a low SSH options timeout value is set, we see sometimes that calls to
new_channel()
andssh_channel_open_session
fail when libssh returnsSSH_AGAIN
. Currently, pylibssh returns an exception:SSH_AGAIN return code is documented https://api.libssh.org/master/group__libssh__channel.html#gaf051dd30d75bf6dc45d1a5088cf970bd
It is not clearly stated this but SSH_AGAIN also happens due to timeout.
ISSUE TYPE
ADDITIONAL INFORMATION
This issue happens in our https://github.com/next-actions/pytest-mh project.
CC @pbrezina