Skip to content

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Aug 1, 2025

Ping @milroy what still needs to be done / thought about:

  • certificate provisioning (this should be easy to run with certbot)
  • share AMI (hub and worker) with HPCIC account
  • check quota on accounts for t4g.2xlarge
  • tutorial content most important
  • remove sudo privileges on startup (no)
  • possibly also remove the ec2_launcher.py
  • test the startup consistency and timing
  • we might want to consider better volume type if it could speed up pulling (done)
  • not sure we have a use case for efa - problem is that t4g does not support them. We can upgrade to hpc7g to get, but more expensive.
  • exposed ports will always need to be defined in the docker-compose.yaml - we will need a way to customize that in the future.
  • Have bring up instance a little before end?

Updates:

  • Entire setup is moved to RADIUSS. I had to rebuild everything - the AMIs were encrypted (does not allow movement)
  • We are just targeting 1 MuMMI component (Createsims) but include other manifests for the interested user
  • We are now using hpc7g.16xlarge with g3. The pull time for createsims is a little over 2 minutes (acceptable)
  • We aren't using efa - it wouldn't make sense for one node, and I don't want to spend more time to work on it.
  • Culling is re-enabled. It's set to 1 hour, which seems right to me.

Updated Todo Items

  • We are ready for testing Friday - we will need to bring up this setup, and enable SSL like before. nginx and certbot are already installed.
  • We should also run it with nohup and ensure that it keeps running (decided to not do)
  • We need to remove the logic that deletes instances on stopping the serve (decided to not do)

@vsoch vsoch force-pushed the add-hpcic-2025 branch 2 times, most recently from f71001b to f148c5f Compare August 4, 2025 03:44
vsoch added 4 commits August 8, 2025 19:21
I have also gone through all the notebooks for chapters
1,2,3 and assumed we have a side by side terminal and
notebook.

Signed-off-by: vsoch <[email protected]>
I am also adding a JournalConsumer example.

Signed-off-by: vsoch <[email protected]>
The issue with the workspace starting in ch4 is
because there was a previously saved workspace
in /home/ubuntu/.jupyter/workspaces on the VM.
We need to clean this up on startup.

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented Aug 10, 2025

The issue with starting in ch4 is fixed:

image

We need to rm -rf /home/ubunti/.jupyter/workspaces and then recreate and change permissions for the directory. Jupyter was detecting a workspace there (a single one) for user-space Kubernetes and assuming the user wants to open there. When we remove it (just tested) we no longer have the issue - it starts with the launcher, as shown above.

@vsoch vsoch merged commit 2237c30 into master Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant