-
Notifications
You must be signed in to change notification settings - Fork 473
Add Galaxy wrapper for BiaPy tool #7168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few first comments.
tools/biapy/biapy.xml
Outdated
</param> | ||
<when value="biapy_pretrained"> | ||
<!-- Q4 --> | ||
<param name="biapy_model_path" type="data" format="data" label="Select the model trained with BiaPy before" help="This question is about setting the path to a pre-trained model trained before using BiaPy. The model should be stored in the folder you previously selected for saving results. Specifically, in the 'checkpoints' folder, there will be a file with a '.pth' extension, which represents the checkpoint of the pre-trained model"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a more specific format than data in Galaxy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I'm not sure how I need to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model checkpoint expected here is in state_dict
format (.pth
extension). Do you have something for that or I can leave it as data
as it is? I suppose that I would need to do the ln -s
as it is done with images, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone confirm this please? @bgruening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have a datatype for these pytorch models. Seems that under the hood these are just pickled files. We try to avoid these files since pickle seems to unsafe: https://stackoverflow.com/questions/25353753/python-can-i-safely-unpickle-untrusted-data
Are there alternative formats that can be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more related to how to handle the file. If I need to proceed as with the training/test data.
Can you be more specific?
As there is no specific format for this file extension I can just put a generic data format as the one I was using I asumme.
I need to stress that processing pth file (ie pickle files) is problematic. There is no problem if you work on private infrastructure and only have trustworthy users. But on public infrastructure it may become one because pickle files may allow arbitrary code execution.
It would be really cool if biapy would support a safer format for saving the models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danifranco if onnx is not possible, do you think you can support https://github.com/huggingface/safetensors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific?
In the same way you explained here: #7168 (comment) . Because this is the way Galaxy handles the input files, no?
Regarding the checkpoint we do not support any other extension but .pth. We load the weights as follows:
torch.load(checkpoint_path, weights_only=True, map_location="cpu")
This restricts deserialization to tensors only, blocking arbitrary Python object execution and greatly reducing pickle-related risks. We also load these weights into a predefined, trusted model architecture (never from user-supplied code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the same way you explained here: #7168 (comment) . Because this is the way Galaxy handles the input files, no?
If biapy requires a specific file extensions for any of the files you need to create a symlink before calling the main program, e.g. ln -s $mode_selection.selected_mode.cfg_config_process.pretrained_model.biapy_model_path
biapy_model.pth` and then use the symlink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just made a new release of BiaPy supporting safetensors. In the code here I mader changes to always load and generate .safetensors checkpoints. I'm pushing the changes now.
Wuha! This is cool!
There are no paths in such files, aren't they?
You can use https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-configfiles-configfile for this, it might make the wrapper easier.
The tool help is the best place. You can write this tool help in markdown or rst.
Yes! :)
There is no automatic detection in place? If so use an environment variable. Something like
I think you can, I would not do that but rather use the tool-help.
You can have two nested conditionals.
On which data should this operate? The input data? If so you can add a separate tool that validates the input and then you create a workflow that runs 1. validation and 2. your tool.
Leave them out for now. @matuskalas has promised to merge those BIII stuff into EDAM at some point :(
It is acceptable, but I think its rather cumbersome for you, I can imagine. Please look at https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-configfiles-configfile ... you can just "fill-out" a YAML file.
What is your output? You can put
Can you provide us with a listing of a typical output file and mark what is relevant for users? Thanks for the tool, that looks very cool!
|
Here I go:
Oh, true. The paths should be updated as they must match the ones locally. I will redo this part as a config file will always need to be stored.
Thank for the hint. Will take a look!
Oh, perfect!
It's being accepted in conda so everything is ready to go! (https://anaconda.org/conda-forge/biapy)
It's not automatic in BiaPy, it is up to the user. I don't know if I'm understanding correctly how the GPU is selected here, but maybe we can add some check before calling BiaPy to detect if there is a GPU available and if so call BiaPy with the '--gpu 0' . Something like this (assuming the shell is bash):
What do you think?
I made it with markdown so it will be better now I think.
I think I can not use them (I just checked again in ChaGPT). I'm referring to having nested conditionals not inside the
Yeah, the validation would be done in the inputs paths provided. I mean, that check is a thing that is done inside BiaPy always but just to let the user check it before running anything else. I'd like to have something after the path is given so if it is not possible don't know if creating a tool to check the data will add more complexity. If there is no option to do that but creating another tool maybe we can leave this and see if it will be a good thing to add in the future.
ok, no problem!
Mmmm, I'm not understanding very well how the things are filled there. My approach should work too, no?
The output is a folder with different files depending on the workflow. For example, for a semantic segmentation workflow the output is something like this:
Some of the files are optional so I was thinking on create a test to just check the main files to be generated such as the predictions from the deep learning model that are placed in Thanks a lot for the support and the quick response! |
@danifranco please let us know if you need any help. The linter complains that you are missing a .shed.yml file. This will also trigger than additional tests. If you have any questions for the config-file handling let me know. |
Hi, I still need help with the point 3 and 9. The third one maybe is quick but for the 9 I would need a bit more guidance on how to test the output. |
For 3) just assume there is a variable (maybe call if BIAPY_GPU_ENABLED) and its set to "true". If this is there just enable GPU. The Galaxy admin can then set this for this tool if needed. For 9) I would do the following: Add a user-input option (type=select multiple=true) and let the user specify which output they want. e.g. charts, binarized images, post-processed images .... And then you define the output collections like this: tools-iuc/tools/hybpiper/hybpiper.xml Line 220 in bb37578
Every collection can contain as many files as you want. I hope that makes sense. Let me know if we can help. |
tools/biapy/biapy.xml
Outdated
ln -s '$mode_selection["biapy_model_path"]' ${checkpoint_file} 2>/dev/null || true && | ||
#set $mpath = $mode_selection.get('biapy_model_path') | ||
#if $mpath and str($mpath) not in ['None', ''] | ||
ln -s '$mpath' ${checkpoint_file} 2>/dev/null || true && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you hiding the errors? Would the following be an option for all the symlink creations:
ln -s '$mpath' ${checkpoint_file} 2>/dev/null || true && | |
ln -fs '$mpath' ${checkpoint_file} && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! I have just added some changes.
…puts, only those that the user selected
…he files exists previously in bash. This will prevent crashing when the users check some outputs even when they are not created
I think I have finally corrected everything. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. Biggest problem I still have with the pickle files.
tools/biapy/biapy.xml
Outdated
<option value="torchvision">TorchVision</option> | ||
</param> | ||
<when value="bmz"> | ||
<param name="bmz_model_name" type="text" format="data" value="sensible-cat" label="BioImage Model Zoo model name" help="Enter the name of a pre-trained model from the BioImage Model Zoo (https://bioimage.io/#/models); filter by the BiaPy icon and ensure the model matches your dimensionality (2D/3D) and task (e.g. semantic segmentation)."/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type="data"?
Would be cool to have tests for more of the possibilities how this tool can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use name="model_name"
here and in the torchvision case you can simplify the command section (because the parameter can be accessed in the same way in the parameter dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want to add a no_value
validator in order to require users to enter something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know anything about the names, could we add a regexp validator to avoid users entering malicious text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, I thing I have addressed everything I can in the commits that I will push soon.
Regarding the tests: I have added one more test, although it is quite similar to the second one. At the moment, I am somewhat limited, as I cannot train models on CPU, this would take too much time. Because of this, the tests are restricted to inference only.
In the first test, I demonstrated the branch that allows loading a custom configuration file. The results are not good, since no pretrained weights are being loaded.
In the second test, I showcased the branch for creating a custom configuration file and loading a pretrained model from the BioImage Model Zoo. In this case, the results are good, as the pretrained model I used was originally trained on data similar to my input.
Finally, I added a third test using a different workflow in BiaPy to denoise images. The results here are also not good, again because I am not loading a pretrained model.
The main challenge is that I cannot include a local .pth (model weights) file in the test-data folder, as these files exceed 5MB and would cause the Galaxy tool linting to fail. The only workaround is to configure the tests to load pretrained models directly from the BioImage Model Zoo (BMZ). However, I would prefer not to rely exclusively on this setup, since it depends on the BMZ website, which is not always consistently available and BiaPy has much more to offer than loading always pretrained models from the zoo.
tools/biapy/biapy.xml
Outdated
#if 'raw' in $outs | ||
######## | ||
## RAW # | ||
######## | ||
&& mkdir -p raw && chmod -R 777 raw && { | ||
## Instance segmentation | ||
if [ -d "output/my_experiment/results/my_experiment_1/per_image_instances" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image_instances/* raw/; | ||
|
||
## Instance segmentation | ||
elif [ -d "output/my_experiment/results/my_experiment_1/full_image_instances" ]; then | ||
cp output/my_experiment/results/my_experiment_1/full_image_instances/* raw/; | ||
|
||
## Semantic segmentation | ||
elif [ -d "output/my_experiment/results/my_experiment_1/per_image_binarized" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image_binarized/* raw/; | ||
|
||
## Semantic segmentation | ||
elif [ -d "output/my_experiment/results/my_experiment_1/full_image_binarized" ]; then | ||
cp output/my_experiment/results/my_experiment_1/full_image_binarized/* raw/; | ||
|
||
## I2I | ||
elif [ -d "output/my_experiment/results/my_experiment_1/full_image" ]; then | ||
cp output/my_experiment/results/my_experiment_1/full_image/* raw/; | ||
|
||
## Detection | ||
elif [ -d "output/my_experiment/results/my_experiment_1/per_image_local_max_check" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image_local_max_check/* raw/; | ||
|
||
## Detection, Denoising, I2I, SSL, SR | ||
elif [ -d "output/my_experiment/results/my_experiment_1/per_image" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image/* raw/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an option to discover_datasets
from the output/my_experiment/results/my_experiment_1/
and use recursive? Then we can save all this copying. Otherwise, since you only use h5 and tiff files we might want to copy only those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot really do it that way. Each experiment generates additional tiff/h5 files inside subfolders, which is why I originally proposed using the entire results folder as the output. That approach avoids having to reproduce all this logic and also keeps the configuration much simpler. It is also important to note that, in this case with the current if-else commands, the order in which the data is copied must be preserved.
tools/biapy/biapy.xml
Outdated
&& mkdir -p post_proc && chmod -R 777 post_proc && { | ||
## Instance segmentation | ||
if [ -d "output/my_experiment/results/my_experiment_1/per_image_post_processing" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image_post_processing/* post_proc/; | ||
|
||
## Instance segmentation | ||
elif [ -d "output/my_experiment/results/my_experiment_1/full_image_post_processing" ]; then | ||
cp output/my_experiment/results/my_experiment_1/full_image_post_processing/* post_proc/; | ||
|
||
## Detection | ||
elif [ -d "output/my_experiment/results/my_experiment_1/per_image_local_max_check_post_proc" ]; then | ||
cp output/my_experiment/results/my_experiment_1/per_image_local_max_check_post_proc/* post_proc/; | ||
fi; | ||
} | ||
#end if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question: discover form the longest common prefix using recursive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as before
tools/biapy/biapy.xml
Outdated
#if $phase in ['train_test', 'train']: | ||
## Error redirection was necessary because chart creation does not always occur at the default interval of every 5 epochs | ||
#if 'tcharts' in $outs | ||
&& mkdir -p train_charts && chmod -R 777 train_charts && | ||
cp output/my_experiment/results/my_experiment_1/charts/*.png train_charts/ 2>/dev/null || true | ||
#end if | ||
#if 'tlogs' in $outs | ||
&& mkdir -p train_logs && chmod -R 777 train_logs && | ||
cp output/my_experiment/train_logs/*.txt train_logs/ 2>/dev/null || true | ||
#end if | ||
#end if | ||
|
||
#if 'checkpoint' in $outs | ||
&& mkdir -p checkpoints && chmod -R 777 checkpoints && | ||
cp output/my_experiment/checkpoints/*.pth checkpoints/ 2>/dev/null || true | ||
#end if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these copies be saved? Just discover from these paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these ones I think we can apply discover_datasets, yes. I'm pushing all the changes soon.
Co-authored-by: M Bernt <[email protected]>
Hi all,
I'm submitting this initial version of a Galaxy wrapper for BiaPy, a deep learning framework for bioimage analysis.
This tool is designed to run a complete BiaPy workflow with two main usage modes:
I’ve been working on this for a few days and, as this is my first Galaxy tool, I have some questions and doubts that I would really appreciate your feedback on:
Parameter help text. Is there a better way to provide detailed help for parameters, other than using long help="..." strings directly in the XML? The inputs have quite a bit of context, so I'm wondering if there's a cleaner way to manage this.
Conda packaging. I’m in the process of adding BiaPy to
conda-forge
(PR here). Once it’s accepted, can I declare it in the requirements section like this?using the
conda-forge
channel?GPU support. BiaPy supports specifying a GPU index as a parameter (e.g.,
--gpu 0
). Training without GPU is very slow, but inference is fine on CPU. What's the best practice in Galaxy for handling GPU support? Should I remove this parameter or keep it optional?Clickable links in parameter help. Can we include URLs in
help=""
so they are rendered as clickable links in the Galaxy interface? I did not test how they look like, because there are still things to do in this PR so it passes the checks, and to try it locally, so maybe the are already displayed like links.Conditional logic depending on two parameters. I’d like to show or hide certain inputs based on the combination of two user choices. From this issue, it seems that’s not yet supported. Is that still the case? If so, I’ve repeated some inputs unnecessarily to work around this, but it would be good to confirm.
GUI-side "check consistency" button. In our current GUI version of BiaPy (outside Galaxy), we allow users to click a button to validate dataset consistency after selecting input folders. Is there a Galaxy-native way to do this (e.g., a button, a validation script or pre-run check)?
EDAM terms. I'm not 100% sure the EDAM operations I selected are correct. Some aren’t even defined in EDAM, but are common DL tasks (like classification, image-to-image translation, etc.). Would it be better to refer to EDAM-bioimaging terms? Or just leave out undefined terms entirely?
Creating the YAML config programmatically. In the second usage mode, I need to generate the YAML file first, then pass it to the BiaPy command. Currently I use a Python script to generate it. Is that acceptable for Galaxy tool execution flow (i.e., intermediate file creation within the command section)?
Testing output folders. Since Galaxy doesn’t support a
type="directory"
output, I was advised (by ChatGPT) to compress the output folder into a .tar.gz file:Is this the correct way to handle directory outputs?
I had two test scenarios in mind:
test_data/input_test_data/example.yaml
) and pretrained model (just to check basic output).I wanted to validate the test by asserting the presence of specific files in the tar archive (e.g. checkpoints, predictions). The full output structure is in
test_data/my_experiment_example.tar.gz
, and the test data is intest-data/input_test_data/
, but I wasn't sure how best to structure this in the test block. In general, I have doubts on how to configure the output and the tests cases so maybe you can guide me a little bit.Thanks a lot!
@bgruening @iuc-member