[DRAFT] Process wide worker heartbeat #962

yuandrew · 2025-07-23T20:18:16Z

What was changed

Why?

Checklist

Closes
How was this tested:

Any docs updates needed?

yuandrew · 2025-07-23T20:20:14Z

core/src/lib.rs

+    if runtime.heartbeat_worker.get().is_none() {
+        let process_key = Uuid::new_v4();
+        // TODO: set max_concurrent_nexus_polls to 1?
+        // let nexus_config = WorkerConfig {


Seems like we want to have a separate config for this worker, or do we want to group them with the existing WorkerConfig? But what if different workers on the same namespace have 2 different configs?

Also, not sure what config options need to be unique here, I'm thinking potentially max_concurrent_nexus_polls, and also heartbeat_interval at least?

So, yeah, the config for these workers should just be entirely in our control, and not derived from some of the other worker's configs.

For example, if you look at how we initialize replay workers, we set a lot of values to 1 or disable things, same story here really.

Off the top of my head we probably want:

Disable workflow and activity polling - this will happen by just not calling the poll APIs, but we can also explicitly set no_remote_activities: true.

Probably use poller autoscaling for the nexus poller with minimum/initial set to 1 and max set fairly low like 10.

Fixed size nexus slots. Honestly not sure what the value should be here, but we can start with something low to begin with.

Pretty much everything else can just be default or off or something, whatever makes sense.

The only option that needs to be derived from an existing worker here is the newly-added heartbeat_interval. The problem is it could be conflicting. So, we can just always pick the smallest value among workers sharing the same heartbeat. That should be added to the docstring.

core/src/lib.rs

yuandrew · 2025-07-23T20:22:00Z

core/src/worker/heartbeat.rs

                    }
                    _ = reset_notify.notified() => {
                        ticker.reset();
                    }
+                    // TODO: handle nexus tasks
+                    res = manager.next_nexus_task() => {


Currently working with Yuri on this

Sushisource · 2025-07-23T20:57:34Z

core/src/lib.rs

@@ -218,6 +263,9 @@ pub struct CoreRuntime {
    telemetry: TelemetryInstance,
    runtime: Option<tokio::runtime::Runtime>,
    runtime_handle: tokio::runtime::Handle,
+    heartbeat_worker: OnceLock<Worker>,
+    heartbeat_fn_map: Arc<Mutex<HeartbeatMap>>,
+    process_key: Uuid,


Probably this name isn't so great any more now that the grouping isn't necessarily per-process.

Honestly we could just call it task_queue_key since ultimately that's what it is. We'll want to change it in the API too.

core/src/worker/heartbeat.rs

core/src/lib.rs

core/src/worker/mod.rs

Sushisource · 2025-07-23T21:41:28Z

core/src/worker/mod.rs

+        // Process-wide nexus worker
+        let worker_heartbeat = if let Some(ref details) = heartbeat_details {


The worker itself should now no longer need a heartbeat manager at all, like we talked about on the call, the only variant of WorkerHeartbeatDetails should be the callback (and then it'll just go away).

In fact, I don't think you need to pass in anything to the Worker at all. Workers can just expose a pub(crate) fn capture_heartbeat_details and when you are registering a new worker with the shared heartbeat manager, you just pass in a callback that calls that.

Sushisource · 2025-07-23T21:43:28Z

core/src/worker/heartbeat.rs

+    }
+}
+
+pub(crate) struct WorkerHeartbeatManager {


I think this goes away after the other changes? All we really need is the map on the runtime, or the shared worker on a client.

…s, as well as used a hack for fixing API defs, think Yuri's PR needs a change to work with SDK

Sushisource · 2025-08-04T22:55:47Z

core-api/src/worker.rs

+    #[builder(default = "Arc::new(AtomicUsize::new(0))")]
+    pub max_cached_workflows: Arc<AtomicUsize>,


The Arc shouldn't be in the config itself - users (lang) shouldn't need to create an Arc or know about it.

Rather, just use it everywhere someone was referencing config.

AtomicUsize doesn't implement Clone, which WorkerConfig derives

Sorry, I mean the config itself shouldn't change at all. It should stay a normal usize. Then you create the arc'd atomic based on it in init.

ahhh, yeah that makes sense. ty

yuandrew · 2025-08-05T22:55:19Z

core/src/test_help/mod.rs

    )
 }

+pub(crate) fn mock_worker_with_heartbeat(mock: MockWorkerClient, config: WorkerConfig) -> Worker {


not used rn, will use for the test i need to fix up, feel free to ignore for now

yuandrew · 2025-08-05T22:59:32Z

New changes with latest change:

instead of using heartbeat_callback, now passing in WorkerHeartbeatData to SharedNamespaceWorker to both allow it to collect heartbeat data, as well as give SharedNamespaceWorker access to the config fields that server tells it to change. (I'm thinking later on can use traits to limit the use of WorkerHeartbeatData to what it needs.
a new WorkerConfigInner that mirrors WorkerConfig, but keeps atomic values for worker commands (rn only max_cached_workflows, but gives us the ability to onboard other config settings later)
For shutdown, when a worker is registered to SharedNamespaceWorker, a callback that removes itself from its SharedNamespaceWorker entry is given to itself, so it can remove itself from the parent map. Same goes for when SharedNamespaceWorker shuts down and the CoreRuntime map

Sushisource · 2025-08-06T21:18:49Z

core/src/worker/mod.rs

+/// Mirrors `WorkerConfig`, but with atomic structs to allow Worker Commands to make config changes
+/// from the server
+#[derive(Clone)]
+pub(crate) struct WorkerConfigInner {


My first reaction was that I didn't like this at all, but I get why you did it. That said I think we can accomplish the same goal with less repetition.

I think: keep the "new/mutable" fields at the top like you've done here, and then have an original_config field which is an Arc<WorkerConfig> (Arc purely to make copying cheaper). Docstring can make it clear that the contents in there can't change (one rub might be the Tuners, if/when we make that remotely changeable).

Sushisource · 2025-08-06T21:23:13Z

core/Cargo.toml

+features = ["history_builders", "serde_serialize"]
+#features = ["history_builders"]


This shouldn't need to have changed I think

Sushisource · 2025-08-06T21:27:20Z

core/src/worker/heartbeat.rs

+pub(crate) type HeartbeatCallback = Arc<dyn Fn() -> WorkerHeartbeat + Send + Sync>;
+pub(crate) type WorkerDataMap = HashMap<String, Arc<Mutex<WorkerHeartbeatData>>>;


Short docstrings would be good on these

Sushisource · 2025-08-06T21:27:48Z

core/src/worker/heartbeat.rs

+/// SharedNamespaceWorker is responsible for polling worker commands and sending worker heartbeat
+/// to the server. This communicates with all workers in the same process that share the same
+/// namespace.


Suggested change

/// SharedNamespaceWorker is responsible for polling worker commands and sending worker heartbeat

/// to the server. This communicates with all workers in the same process that share the same

/// namespace.

/// SharedNamespaceWorker is responsible for polling nexus-delivered worker commands and sending worker heartbeats

/// to the server. This communicates with all workers in the same process that share the same

/// namespace.

Sushisource · 2025-08-06T21:29:11Z

core/src/worker/heartbeat.rs

+    // Worker commands
+    workflow_cache_size: Arc<AtomicUsize>,
+    workflow_poller_behavior: PollerBehavior,


Why are these in the heartbeat data?

Sushisource · 2025-08-06T21:59:59Z

core/src/worker/heartbeat.rs

+        }
+    }
+
+    fn fetch_config(&self) -> fetch_worker_config_response::WorkerConfigEntry {


I don't think this belongs on data. Semantically this is more like another callback.

Sushisource · 2025-08-06T22:19:50Z

core/src/worker/mod.rs

@@ -121,8 +126,206 @@ pub struct Worker {
    local_activities_complete: Arc<AtomicBool>,
    /// Used to track all permits have been released
    all_permits_tracker: tokio::sync::Mutex<AllPermitsTracker>,
-    /// Used to shutdown the worker heartbeat task
-    worker_heartbeat: Option<WorkerHeartbeatManager>,
+    worker_heartbeat_data: Option<Arc<Mutex<WorkerHeartbeatData>>>,


Similar issue here - WorkerHeartbeatData is mixing a lot of concerns. Leave data as just data, and separate out the behavior. My comments on capture_heartbeat / fetch_config get at that.

Like I had in my comment on the last review round - we can just have get_worker_heartbeat_data(), which you do have, but we don't need to actually store data - you can just construct a new one every time it's called.

Sushisource · 2025-08-06T22:19:51Z

core/src/worker/heartbeat.rs

    }
 }
-
+// TODO: rename
+// TODO: impl trait so entire struct doesn't need to be passed to Worker and SharedNamespaceWorker


Shouldn't be necessary anyway, per my other comments.

Sushisource · 2025-08-06T22:25:29Z

core/src/worker/mod.rs

@@ -404,13 +621,15 @@ impl Worker {
                };

                let np_metrics = metrics.with_new_attrs([nexus_poller()]);
+                // This starts the poller thread.


Not really a thread - tasks. But, kind of an uninteresting comment anyway

Sushisource · 2025-08-06T22:26:35Z

core/src/abstractions.rs

+    /// there will need to be some associated refactoring. // TODO: sounds like we'll need to do this
+    max_permits: Option<Arc<AtomicUsize>>,


Potentially, yes. Let's leave proper implementations of the commands for later though, since this PR is already quite large, and they won't be working server side for a while anyway.

yuandrew · 2025-08-18T22:36:14Z

Closing this PR in favor separating this out into a process-wide heartbeat PR, and a worker commands PR that will come in the future.

yuandrew added 4 commits July 22, 2025 08:12

Heartbeat moved to process level, need to create and async poll pollers

e5df62f

before moving heartbeat to a per-namespace level

42c050d

moved heartbeat to per-namespace level

a062614

Clean up

0c549cd

yuandrew commented Jul 23, 2025

View reviewed changes

core/src/lib.rs Show resolved Hide resolved

yuandrew commented Jul 23, 2025

View reviewed changes

Sushisource reviewed Jul 23, 2025

View reviewed changes

yuandrew added 5 commits July 24, 2025 17:05

worker heartbeat refactored, need to do nexus now

9e5ac93

Update API to Yuri's branch, ee2bd590b4ad3d0b6df8d6ea948dd3cf8f518129

fdd6f29

Core logic implemented, need to clean up and finish implementing TODO…

0b040fa

…s, as well as used a hack for fixing API defs, think Yuri's PR needs a change to work with SDK

need to get rid of Arc<Mutex<>> of worker_data_map

f1099f6

Add in shutdown flow

d49b19c

Sushisource reviewed Aug 4, 2025

View reviewed changes

yuandrew added 2 commits August 4, 2025 16:42

WIP, fixing test

9ee8ee6

Made WorkerConfigInner, code compiles, need to fix test

3c3af9c

yuandrew commented Aug 5, 2025

View reviewed changes

yuandrew added 2 commits August 5, 2025 16:03

git merge main

e93b015

Merge branch 'master' into process-wide-worker-heartbeat

6d2dc1a

yuandrew requested a review from Sushisource August 5, 2025 23:07

Sushisource reviewed Aug 6, 2025

View reviewed changes

yuandrew mentioned this pull request Aug 18, 2025

Runtime-wide worker heartbeat #983

Open

yuandrew closed this Aug 18, 2025

		// Process-wide nexus worker
		let worker_heartbeat = if let Some(ref details) = heartbeat_details {

		#[builder(default = "Arc::new(AtomicUsize::new(0))")]
		pub max_cached_workflows: Arc<AtomicUsize>,

		features = ["history_builders", "serde_serialize"]
		#features = ["history_builders"]

		pub(crate) type HeartbeatCallback = Arc<dyn Fn() -> WorkerHeartbeat + Send + Sync>;
		pub(crate) type WorkerDataMap = HashMap<String, Arc<Mutex<WorkerHeartbeatData>>>;

		/// there will need to be some associated refactoring. // TODO: sounds like we'll need to do this
		max_permits: Option<Arc<AtomicUsize>>,

[DRAFT] Process wide worker heartbeat #962

[DRAFT] Process wide worker heartbeat #962

Uh oh!

Conversation

yuandrew commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuandrew commented Aug 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuandrew commented Aug 18, 2025

Uh oh!

Uh oh!

yuandrew commented Jul 23, 2025 •

edited

Loading