Skip to content

feat: Add checkpoint forking functionality #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 17, 2025
Merged

feat: Add checkpoint forking functionality #253

merged 6 commits into from
Jul 17, 2025

Conversation

corbt
Copy link
Contributor

@corbt corbt commented Jul 17, 2025

Summary

This PR adds the ability to fork checkpoints from existing models, which is useful when training goes off the rails and you want to restart from a previous checkpoint with different parameters.

Changes

  • Added _experimental_fork_checkpoint method to Backend and LocalBackend classes
  • Changed S3 pulling API from latest_only boolean to only_step parameter that accepts either an int or "latest"
  • Added debugging output to help diagnose S3 sync issues

Example Usage

# In project_types.py, configure fork settings
class ProjectPolicyConfig(BaseModel):
    fork_from_model: str | None = None
    fork_from_project: str | None = None  
    fork_not_after_step: int | None = None

# In all_experiments.py, create a forked model
models["229"] = models["224"].model_copy(deep=True)
models["229"].name = "email-agent-229"
models["229"].config.fork_from_model = "email-agent-224"
models["229"].config.fork_not_after_step = 1381

# In train.py, fork will happen automatically if configured
if model.config.fork_from_model:
    await backend._experimental_fork_checkpoint(
        model,
        from_model=model.config.fork_from_model,
        from_s3_bucket=os.environ["BACKUP_BUCKET"],
        not_after_step=model.config.fork_not_after_step,
        verbose=True,
    )

Key API Changes

only_step parameter on

  • only_step=None - Pull all checkpoints (default)
  • only_step="latest" - Pull only the latest checkpoint
  • only_step=1234 - Pull only checkpoint 1234

@corbt
Copy link
Contributor Author

corbt commented Jul 17, 2025

The good news is that forking seems to work. (The bad news is that my forked run failed in almost the same place, but you can't win 'em all).

Screenshot 2025-07-16 at 5 48 22 PM

@corbt corbt force-pushed the fork-checkpoints branch 2 times, most recently from acef5c8 to 2d34eba Compare July 17, 2025 02:02
corbt and others added 3 commits July 16, 2025 20:33
This PR adds the ability to fork checkpoints from existing models, which is useful
when training goes off the rails and you want to restart from a previous checkpoint
with different parameters.

Key changes:
- Added `_experimental_fork_checkpoint` method to fork from existing models
- Changed S3 pulling API from `latest_only` to `only_step` parameter for cleaner interface
- Changed `before_step` to `not_after_step` with <= comparison for more intuitive behavior
- Updated to only support new checkpoint structure (checkpoints/ subdirectory)
- Added debugging output to help diagnose S3 sync issues

Example usage:
```python
await backend._experimental_fork_checkpoint(
    model,
    from_model="email-agent-224",
    from_s3_bucket=os.environ["BACKUP_BUCKET"],
    not_after_step=1381,
    verbose=True,
)
```

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Created new Features section in documentation
- Added comprehensive checkpoint forking guide
- Added model 230 that forks from model 206 at step 90

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Moved additional-histories.mdx from Fundamentals to Features section
- Updated sidebar navigation to show both features

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@corbt corbt force-pushed the fork-checkpoints branch from 2d34eba to 15f6db4 Compare July 17, 2025 03:33
corbt and others added 2 commits July 16, 2025 21:06
- Add forked-run.webp image showing run recovery example
- Move experimental note below introduction for better flow
- Update section title for clarity

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@corbt corbt marked this pull request as ready for review July 17, 2025 13:10
@corbt
Copy link
Contributor Author

corbt commented Jul 17, 2025

@bradhilton I think this is ready for review!

@corbt corbt requested a review from bradhilton July 17, 2025 13:15
- Remove unused Literal import from project_types.py
- Auto-formatted by ruff

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@corbt corbt merged commit 5ad7c6e into main Jul 17, 2025
2 checks passed
@bradhilton bradhilton deleted the fork-checkpoints branch July 17, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants