Skip to content

Docker profiles, container omission, fixed job cancellation #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

micoleaoo
Copy link
Collaborator

Changes made:

  1. added docker profiles
  2. container ommision if input_confs or output_confs is not present
  3. fixed job cancellation issues causing Galaxy errors, lingering jobs, and TESP API error loops.

Problems & Fixes regarding 3rd point:

  1. Galaxy JSONDecodeError on Cancel:

    • Problem: TESP API's /cancel endpoint sent an empty response body; Galaxy expected JSON ({}).
    • Fix: Updated TESP API's /cancel endpoint (task_endpoints.py) to return JSONResponse(content={}), sending "{}".
  2. TESP API Kept Polling Cancelled Jobs:

    • Problem: handle_run_task (in event_actions.py) continued polling Pulsar for jobs already cancelled by the API, leading to LookupErrors and incorrectly changing task state from CANCELED to SYSTEM_ERROR.
    • Fixes (event_actions.py):
      • Error Handling: If polling Pulsar fails (e.g., LookupError), handle_run_task now checks the DB. If task is CANCELED, it exits gracefully (as the cancel API handled it).
      • Early Exit: Added checks to handle_run_task to stop processing if a task is found to be CANCELED early in its execution or immediately after the Pulsar job finishes/fails.
      • Cleaned Executor Errors: Ensured Pulsar jobs are erased if a task fails due to an executor error (not just cancellation).

Outcome:

  • Galaxy cancellations are now smooth.
  • TESP API correctly stops processing and polling for cancelled jobs.
  • Task states are managed more accurately, especially preventing CANCELED tasks from becoming SYSTEM_ERROR.

BorisYourich and others added 8 commits May 5, 2025 10:49
…ogic/rewrote code for those conditions in event_actions.py: initialization, conditional command building, command joining, empty command check, singularity placeholders, error handling/logging
@micoleaoo micoleaoo requested a review from BorisYourich May 26, 2025 12:01
@micoleaoo micoleaoo self-assigned this May 26, 2025
@@ -5,7 +5,8 @@
from pymonad.promise import Promise
from fastapi.params import Depends
from fastapi import APIRouter, Body
from fastapi.responses import Response
# MODIFIED: Import JSONResponse
Copy link
Member

@martenson martenson May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop all LLM descriptive comments and guides, instead you can include include short inline documentation and comments where you deem necessary and longer descriptions in docs for methods and classes

@micoleaoo micoleaoo linked an issue May 27, 2025 that may be closed by this pull request
Copy link
Collaborator

@BorisYourich BorisYourich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to the comments @micoleaoo, thank you :)

print("Volumes:")
print(volumes)
output_confs, volume_confs = map_volumes(str(job_id), volumes, outputs)
mapped_outputs, mapped_volumes = map_volumes(str(job_id), volumes, outputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary temporary variables, keep the original, or what was the reason for this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an incorrect assumption that the list is already containing data. I'll keep the original.

await Promise(lambda resolve, reject: resolve(None))\
.then(lambda nothing: task_repository.update_task_state(
task_id,
TesTaskState.QUEUED,
TesTaskState.INITIALIZING
)).map(lambda updated_task: get_else_throw(
updated_task, TaskNotFoundError(task_id, Just(TesTaskState.QUEUED))
)).then(lambda updated_task: setup_data(
)).then(lambda updated_task_val: setup_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain this change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic mishap, I'll discard this change

task = await task_repository.update_task_state(
task_id,
TesTaskState.RUNNING,
TesTaskState.EXECUTOR_ERROR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TesTaskState.EXECUTOR_ERROR change is not included in latest changes


if command_status.get('returncode', -1) != 0:
print(f"Task {task_id} executor error (return code: {command_status.get('returncode', -1)}). Setting state to EXECUTOR_ERROR.")
await task_repository.update_task_state(task_id, TesTaskState.RUNNING, TesTaskState.EXECUTOR_ERROR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does it work now logically in the same way as before with the TesTaskState.EXECUTOR_ERROR ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the logic should be preserved and made more robust:

  1. After the Pulsar job finishes (or job_status_complete returns), we check command_status.get('returncode', -1) != 0.

  2. Before declaring it an EXECUTOR_ERROR, there is a check if the task has been canceled by the API, because a cancelled job might also result in a non-zero exit code from Pulsar's perspective (e.g., if it was killed), so we prioritize the CANCELED state set by the user.

  3. If it's not CANCELED and the return code is non-zero, then we explicitly update the task state to TesTaskState.EXECUTOR_ERROR (from TesTaskState.RUNNING).

  4. We also now explicitly call pulsar_operations.erase_job(task_id) in this path to ensure the failed Pulsar job is cleaned up.

  5. Then, we return to prevent the code from falling through to the logic that sets the state to COMPLETE.

@micoleaoo micoleaoo requested a review from BorisYourich June 13, 2025 08:08
README.md Outdated

#### All services (default):
```
docker compose --profile all up -d
Copy link
Collaborator

@BorisYourich BorisYourich Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the readme, no longer applies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker compose profiles for pulsar omission
3 participants