Skip to content

Conversation

Idriss-Malek
Copy link
Contributor

This pull request adds support for using custom rewards in GFlowNet implementations, enabling the use of intrinsic rewards for example (https://openreview.net/forum?id=HH4KWP8RP5).

Changes

  • Added a log_rewards argument to GFlowNet implementations.

    • By default, it is set to None, in which case the environment's reward is used. This preserves compatibility with existing code.
    • When provided, log_rewards is expected to be a tensor of the same shape as the container (e.g., (n_trajectories,), (n_transitions,), etc.), and it's used instead of the environment reward.
  • In the DB and SubTB losses, I did not modify the forward-looking setup, except to rename the argument to fl_log_rewards to avoid naming conflicts.

    • I am not sure what this forward-looking setup does exactly (would appreciate any pointers to the original reference).
    • For now, I recommend using custom rewards with caution when using the forward-looking setup.

@Idriss-Malek Idriss-Malek changed the title Added custom log_reward option to GFlowNets. Quality of life: Added custom log_reward option to GFlowNets. May 19, 2025
@hyeok9855
Copy link
Collaborator

Hi @Idriss-Malek, thank you for submitting this PR!

Your current implementation adds arguments for the intrinsic rewards, which is clear, but I wonder if this is the best approach. What about creating a wrapper of env that enables adding the intrinsic reward to the original reward?

@Idriss-Malek
Copy link
Contributor Author

Hi @hyeok9855
Currently, GFNs use trajectories.log_reward to compute the loss. To use intrinsic reward, I see three solutions:

  • Since during sampling, the reward is assigned to the trajectory by seeing what is the reward in the environment, changing the env's reward is a possible solution. However, the loss will still be computed with trajectories.log_reward which is equal to env.log_rewards(trajectories.terminal_state), which means that If we use 10 GFlowNets with 10 different intrinsic rewards, we will have 10 custom envs.
  • Directly changing the assigned reward in the trajectory, but I feel this is awkward.
  • Directly specifying which reward to use the moment we compute the loss (what I did).

Correct me if I misunderstood your message !!

@hyeok9855
Copy link
Collaborator

If we use 10 GFlowNets with 10 different intrinsic rewards, we will have 10 custom envs.

If we implement this as a wrapper that takes the original environment as an argument, then I think this would simply be 10 wrappers with a single custom environment?

Maybe we should gather more opinions to make a decision on this. @josephdviviano, @younik any thoughts?

@younik
Copy link
Collaborator

younik commented May 28, 2025

If we use 10 GFlowNets with 10 different intrinsic rewards, we will have 10 custom envs.

If we implement this as a wrapper that takes the original environment as an argument, then I think this would simply be 10 wrappers with a single custom environment?

Maybe we should gather more opinions to make a decision on this. @josephdviviano, @younik any thoughts?

I also favor the wrapper approach, assuming that the intrinsic reward depends only on the state of the environment, and not on the state of the net

@Idriss-Malek
Copy link
Contributor Author

If we use 10 GFlowNets with 10 different intrinsic rewards, we will have 10 custom envs.

If we implement this as a wrapper that takes the original environment as an argument, then I think this would simply be 10 wrappers with a single custom environment?
Maybe we should gather more opinions to make a decision on this. @josephdviviano, @younik any thoughts?

I also favor the wrapper approach, assuming that the intrinsic reward depends only on the state of the environment, and not on the state of the net

The intrinsic reward doesn't always depend on the env. In https://openreview.net/forum?id=HH4KWP8RP5 for example, the intrinsic reward used to train a secondary gfn depends on the primary gfn.

@younik
Copy link
Collaborator

younik commented May 28, 2025

If we use 10 GFlowNets with 10 different intrinsic rewards, we will have 10 custom envs.

If we implement this as a wrapper that takes the original environment as an argument, then I think this would simply be 10 wrappers with a single custom environment?
Maybe we should gather more opinions to make a decision on this. @josephdviviano, @younik any thoughts?

I also favor the wrapper approach, assuming that the intrinsic reward depends only on the state of the environment, and not on the state of the net

The intrinsic reward doesn't always depend on the env. In https://openreview.net/forum?id=HH4KWP8RP5 for example, the intrinsic reward used to train a secondary gfn depends on the primary gfn.

If this is fixed during the second training, i.e. you train one GFlowNet, and then you can create a wrapper based on it to train a second GFlowNet, the wrapper approach is still fine.

However, it is hard to decide the best implementation without a concrete use case, so maybe we should create an example/tutorial for a common use case?

@josephdviviano
Copy link
Collaborator

Forward looking GFN reference: https://arxiv.org/pdf/2302.01687

@josephdviviano
Copy link
Collaborator

From eqn2 in your paper - it appears that the intrinsic reward and extrinsic reward are simply two independent terms which are summed.

Is it possible to always compute this intrinsic reward independently of the environment? I would hope so because torchgfn environments are stateless. Ideally, the policy would take in the current state of the environment and compute the intrinsic reward from that. So trajectories.log_rewards would simply accept as argument both the policy (for intrinsic reward computation - optionally) and the environment (for extrinsic reward computation).

This completely avoids the cartesian product of intrinsic x extrinsic reward types.

LMK if I misunderstand something.

@hyeok9855
Copy link
Collaborator

However, it is hard to decide the best implementation without a concrete use case, so maybe we should create an example/tutorial for a common use case?

I agree. @Idriss-Malek, could you share an example that requires intrinsic reward if you have one?

@saleml
Copy link
Collaborator

saleml commented Jun 27, 2025

Thanks @Idriss-Malek for the PR. I like the added flexbility. Before merging, it would be better to have some small example, as a new script or as an added argument to the existing scripts, that showcases the utility of this new feature. Can you think of some minimal example?

@josephdviviano
Copy link
Collaborator

@saleml What's the status of this PR? Can we help in any way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants