Skip to content

Retry on libssh SSH_AGAIN return code #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

justin-stephenson
Copy link

@justin-stephenson justin-stephenson commented Jul 30, 2025

SUMMARY

When a low SSH options timeout value is set, we see sometimes that calls to new_channel() and ssh_channel_open_session fail when libssh returns SSH_AGAIN. Currently, pylibssh returns an exception:


../../../../pytest-mh/pytest_mh/conn/ssh.py:285: in _run
    self.__channel = self.__conn.new_channel()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
src/pylibsshext/session.pyx:514: in pylibsshext.session.Session.new_channel
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>   ???
E   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]
src/pylibsshext/channel.pyx:71: LibsshChannelException

SSH_AGAIN return code is documented https://api.libssh.org/master/group__libssh__channel.html#gaf051dd30d75bf6dc45d1a5088cf970bd

It is not clearly stated this but SSH_AGAIN also happens due to timeout.

ssh_channel_open_session()

Returns
    SSH_OK on success, SSH_ERROR if an error occurred, SSH_AGAIN if in nonblocking mode and call has to be done again.
ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

This issue happens in our https://github.com/next-actions/pytest-mh project.

CC @pbrezina

This comment was marked as outdated.

1 similar comment

This comment was marked as outdated.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Jul 30, 2025
Comment on lines 2 to 3
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN.
:user:`justin-stephenson`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN.
:user:`justin-stephenson`.
``ssh_userauth_password`` are now retried when ``libssh`` returns ``SSH_AGAIN``
-- by :user:`justin-stephenson`.

@@ -0,0 +1,3 @@
The ``Channel`` class calls to libssh ``ssh_channel_open_session`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure classes don't call things. They represent states. Could you rephrase?

Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could result in an infinite loop..

Also, I don't think there's a proof that this works as intended without tests. Add them.

@KB-perByte KB-perByte self-requested a review August 6, 2025 09:42
@Jakuje
Copy link
Contributor

Jakuje commented Aug 6, 2025

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

@justin-stephenson
Copy link
Author

Thank you for your comments and review.

I wonder if this could result in an infinite loop..

Also, I don't think a proof that this works as intended without tests. Add them.

I tried to add a test for this but unfortunately I couldn't reproduce the scenario we see in the test environment. I'll try again.

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

I can change this PR to add a call to ssh_is_connected() to avoid an infinite loop, or I can raise a different exception when SSH_AGAIN is returned (like LibsshChannelAgain) then we will handle this exception in our calls to pylibssh methods.

Whichever you prefer is acceptable for us, just let me know and i'll make those changes.

@Jakuje
Copy link
Contributor

Jakuje commented Aug 7, 2025

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

@justin-stephenson
Copy link
Author

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

The error we currently see in our PRCI is specific to ssh_channel_open_session failure:

FAILED tests/test_authentication.py::test_authentication__user_login_with_overriding_home_directory[domain] (ldap) - pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

I added the session: commit just as a nice to have because ssh_userauth_password() can return SSH_AGAIN per the libssh API docs, but the channel.pyx commit is the main issue we are hitting currently.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

In our code we set .set_ssh_options("timeout", 1) because in our pytest-mh code we allow users to to execute commands over SSH on hosts with an arbitrary timeout value set, such as:

client.host.conn.run(..., timeout=X)

If I understand correctly, setting this low set_ssh_options("timeout")" value is necessary for the above to work as expected because Python will not deliver signal if the code is blocked in C library The signal is delivered only after we get back to the Python code. @pbrezina can correct me here.

-- for reference https://github.com/next-actions/pytest-mh/blob/master/pytest_mh/conn/ssh.py

justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
@Jakuje
Copy link
Contributor

Jakuje commented Aug 14, 2025

Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think?

It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override.

And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time.

@justin-stephenson
Copy link
Author

Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think?

It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override.

I went ahead and updated the PR with your suggestion, by adding a new option open_session_retries which can be provided to the connect() method in src/pylibsshext/session.pyx. To make it less invasive I I set the default value for this to 0, so default libssh behavior will not change. Please take a look.

And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time.

I added some test scaffolding that is not done yet. In the test environment I always see ssh_channel_open_session() returns SSH_OK instead of SSH_AGAIN even with setting a low timeout therefore I don't see how to test the retries properly in the test environment.

@Jakuje
Copy link
Contributor

Jakuje commented Aug 20, 2025

I added some test scaffolding that is not done yet. In the test environment I always see ssh_channel_open_session() returns SSH_OK instead of SSH_AGAIN even with setting a low timeout therefore I don't see how to test the retries properly in the test environment.

The libssh has an option SSH_OPTIONS_TIMEOUT_USEC to set subsecond timeouts so if this could help reproducing the issue (or spending less time in c code), we can either expose this option, or do again some mangling to allow float input on the python side and then convert it to SSH_OPTIONS_TIMEOUT or SSH_OPTIONS_TIMEOUT_USEC. Not sure what would be nicer/easier from the API point of view, but I think the second approach sounds mostly transparent for users.

https://api.libssh.org/master/group__libssh__session.html#ga7a801b85800baa3f4e16f5b47db0a73d

@webknjaz
Copy link
Member

I clicked "rebase" so this PR pulls in the CI fixes.

@webknjaz webknjaz removed the bot:chronographer:provided There is a change note present in this PR label Aug 21, 2025
@webknjaz
Copy link
Member

The change note was lost in your last force-push.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 21, 2025
@justin-stephenson
Copy link
Author

I added some test scaffolding that is not done yet. In the test environment I always see ssh_channel_open_session() returns SSH_OK instead of SSH_AGAIN even with setting a low timeout therefore I don't see how to test the retries properly in the test environment.

The libssh has an option SSH_OPTIONS_TIMEOUT_USEC to set subsecond timeouts so if this could help reproducing the issue (or spending less time in c code), we can either expose this option, or do again some mangling to allow float input on the python side and then convert it to SSH_OPTIONS_TIMEOUT or SSH_OPTIONS_TIMEOUT_USEC. Not sure what would be nicer/easier from the API point of view, but I think the second approach sounds mostly transparent for users.

https://api.libssh.org/master/group__libssh__session.html#ga7a801b85800baa3f4e16f5b47db0a73d

Thank you. I was able to add 2 tests for this with subsecond timeout: 1 which reproduces the LibsshChannelException: Failed to open_session: exception and 1 which fixes this by setting the new open_session_retries value. I updated the PR it is now ready for review.

@justin-stephenson
Copy link
Author

The change note was lost in your last force-push.

It is added now.

Improve pylibssh handling when libssh ssh_channel_open_session()
returns SSH_AGAIN. Add a new 'open_session_retries' session connect()
parameter to allow a configurable number of retries. SSH_AGAIN may be
returned when setting a low SSH options timeout value.

The default option value is 0, no retries will be attempted.
Copy link
Contributor

@Jakuje Jakuje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is conceptually going in the right direction. It will need some polishing to make the CI and happy. I would also try to consider if other places where this might happen (from my understanding this could happen almost everywhere, but I might be wrong. If I am not, we will need more generic description than we have now).
At least for what we have a coverage could be tried by setting small timeout in the test fixture ssh_session_connect() to see where it will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants