Runtime-wide worker heartbeat #983

yuandrew · 2025-08-18T22:34:48Z

NOTE: this targets a worker-heartbeat feature branch to merge into, that way I can make incremental progress and not hit main until the whole feature is ready for both server and SDK.

What was changed

Breaking change: Moved heartbeat_interval from a per-worker config setting into a per-runtime config setting.
WorkerHeartbeatData is now gone, HeartbeatFn callback is enough to grab heartbeat data whenever needed.
A new SharedNamespaceWorker is spawned with the first worker for each namespace, and will handle heartbeating for all workers in the same namespace (on the same process) via nexus.

Why?

Worker heartbeat. I separated WorkerHeartbeat out of #962. Some of the design for this is with Worker Commands in mind, but that will come at a later time.

Checklist

Closes
How was this tested:

fixed up worker_heartbeat unit test

Any docs updates needed?

cursor · 2025-08-18T22:37:35Z

core/src/worker/heartbeat.rs

+                            last_heartbeat_time_map.insert(instance_key.clone(), now);
+                        }
+                        if let Err(e) = client_clone.record_worker_heartbeat(client_identity.namespace.clone(), client_identity.endpoint.clone(), hb_to_send
+                            ).await {


Bug: Heartbeat Identity Error

The record_worker_heartbeat call uses client_identity.endpoint for the identity parameter. This is incorrect; the method expects the worker's identity string, not the endpoint URL, leading to heartbeats being sent with an inaccurate client identity.

cursor · 2025-08-18T22:37:35Z

core/src/worker/heartbeat.rs

+                    // remove this worker from the Runtime map
+                    shutdown_callback();
+                    return;
+                }


Bug: Worker Premature Shutdown Due to Heartbeat Race Condition

The SharedNamespaceWorker may prematurely shut down. A race condition exists where the heartbeat loop checks for an empty heartbeat_map and exits, potentially before callbacks are registered or during concurrent registration.

Nice, this caught a race condition, although it might have just seen my TODO comment right above this mentioning the race condition.

I was thinking I could use a channel to signal once the first heartbeatFn got added, is there a cleaner/rustier way to do this?

cretz

PR titled "process-wide" but actually it's "runtime-wide" it seems. But IMO it should be at the client level, where you have the connection to actually make the invocation and is the thing workers share to communicate to server.

cretz · 2025-08-19T12:31:57Z

core-c-bridge/src/runtime.rs

@@ -31,6 +32,7 @@ use url::Url;
 #[repr(C)]
 pub struct RuntimeOptions {
    pub telemetry: *const TelemetryOptions,
+    pub heartbeat_duration_millis: u64,


Suggested change

pub heartbeat_duration_millis: u64,

pub worker_heartbeat_duration_millis: u64,

Pedantic, but we have multiple forms of heartbeat, so need to quality if a bit (same for the CoreRuntime option IMO)

cretz · 2025-08-19T12:32:42Z

core-c-bridge/src/runtime.rs

@@ -238,9 +240,13 @@ impl Runtime {
        } else {
            CoreTelemetryOptions::default()
        };
+        let core_runtime_options = CoreRuntimeOptions::new(
+            telemetry_options,
+            Some(Duration::from_millis(options.heartbeat_duration_millis)),


Shouldn't 0 turn this off?

cretz · 2025-08-19T12:34:46Z

core/src/lib.rs

+        let heartbeat_callback = worker
+            .get_heartbeat_callback()
+            .expect("Worker heartbeat data should exist for non-shared namespace worker");
+        let remove_worker_callback = runtime.register_heartbeat_callback(


Why not put the heartbeat stuff like registration and interval duration and such on the client instead of on runtime? What benefit are you getting by doing it at the runtime level? At least at the client level, you have the exact client to use. While it doesn't always seem like there's a use case for multiple clients to the same endpoint/namespace, there may be, especially as we get more fine-grained permissions (a client that can support a worker is not the same as a client that may be used to start a workflow, they may have different API keys on purpose).

a client that can support a worker is not the same as a client that may be used to start a workflow, they may have different API keys on purpose

Can you explain this more? Are you saying it's possible for a client to not have permissions to heartbeat to the server?

To me it feels like there isn't a need for client-level granularity of heartbeat settings. And even if there is a scenario where there's multiple clients on the same endpoint/namespace, wouldn't they be configured on different runtimes/processes? To me it feels easier to think, for this process I'm creating, i have this heartbeat duration

Can you explain this more? Are you saying it's possible for a client to not have permissions to heartbeat to the server?

I am saying you cannot know how a user configured their client for a worker, so you should not blindly assume that they will only have one client per endpoint/namespace. You should use the client that the worker is configured with to make the heartbeat, there is no reason that I can see to make this runtime level.

To me it feels like there isn't a need for client-level granularity of heartbeat settings

It's not just about granularity of settings, it's about clarity of lifetimes and associating workers to clients not to runtimes. If you want the worker heartbeat settings at a runtime level, ok (seems no good reason to though). But that is unrelated to where this shared heartbeater is, which should be the client. Is there a good reason to use runtime instead of client here for the shared namespace/heartbeating?

I can't see a reason to do anything with CoreRuntime at all for this feature that is very client and workers-on-client specific. But if we have to put these settings at the runtime level even though it's really client level, we can, though would rather not, but still the collection of workers to heartbeat can still be on the client.

cretz · 2025-08-19T12:42:49Z

core/src/lib.rs

+        if let Some(ref heartbeat_interval) = self.heartbeat_interval {
+            let mut shared_namespace_map = self.shared_namespace_map.lock();
+            let worker = shared_namespace_map
+                .entry(client_identity.clone())


We should not use this "client identity" mechanism IMO. Instead, the state of registered workers and the heartbeater and such should just live on the client. Is there a good reason why not and a good reason why to have a map at the runtime level and potentially not use the client the user configured the worker with?

How does this potentially use a client that the user configured the worker with? We use the same client as is used to start the worker

We use the same client as is used to start the worker

Not true, you only use the first one that matches this "identity". And if I have two workers on separate clients that happen to share the same endpoint and namespace, but maybe I've given a different set of headers to one in a certain way for my proxy? The worker heartbeat should only ever be called on the literal client the worker was registered with. That's what should hold the group of workers.

Regardless of how rare it may be to have multiple clients on the runtime to the same endpoint/namespace, is there a good reason why not to make the worker heartbeat registration/collection stuff on client and leave runtime alone? It seems like such a better lifetime management approach to put the workers related to a client on a client. Is there any good reason to do this at the runtime level?

Sushisource

As for the client/runtime thing, I think this is pretty preferential and an internal detail that won't matter much to users.

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps). They are used when creating a worker, but the relationship is more like a worker "has" a client, and more than one worker may have the same client.

That's why I don't really want the heartbeating duration to be a client property. Right now our client options in the lang layer don't directly reference anything worker specific (they do indirectly with interceptors/plugins), and I think it should really stay that way. Specifying a heartbeating duration for a client that is never used with a worker, for example, is weird.

So, that means either this option stays on a Runtime (seems perfectly fine to me) or needs to be passed in when initting the worker, but then there's some last-write-wins problem if users use the same client with different values when passing to workers. So, more reason to put it on the Runtime.

As for literally where this map lives or not - I don't have a huge preference either way. Changing the trait that was used to get slot suppliers for eager to be more generic & used for this I can see. I'm also fine with the current setup.

Sushisource · 2025-08-20T16:54:48Z

core/src/worker/heartbeat.rs

+    /// Adds `WorkerHeartbeatData` to the `SharedNamespaceWorker` and adds a `HeartbeatCallback` to
+    /// the client that Worker


Docstrings can use [TypeName] to make links, rather than it being in single quotes

Also this should describe what the return type is

Sushisource · 2025-08-20T16:55:56Z

core/src/worker/heartbeat.rs

-        mock.expect_record_worker_heartbeat()
-            .times(2)
-            .returning(move |heartbeat| {
+        let heartbeat_count = Arc::new(Mutex::new(0));


Test so it doesn't matter much, but, an atomic usize would be easier than a mutex

Sushisource · 2025-08-20T17:01:13Z

core/src/worker/mod.rs

+    /// Instance key used to identify this worker in worker heartbeating
+    worker_instance_key: String,
+    /// Collects data to send to server via worker heartbeat
+    worker_heartbeat_callback: Option<HeartbeatFn>,
+    /// Used to remove this worker from the parent map used to track this worker for
+    /// worker heartbeat
+    worker_heartbeat_shutdown_callback: OnceCell<Arc<dyn Fn() + Send + Sync>>,


Probably would be nice to wrap these up in a little struct (can just be private to this file)

Sushisource · 2025-08-20T17:02:45Z

core/src/worker/mod.rs

-    worker_heartbeat: Option<WorkerHeartbeatManager>,
+    /// Instance key used to identify this worker in worker heartbeating
+    worker_instance_key: String,
+    /// Collects data to send to server via worker heartbeat


Should say when this isn't set

Sushisource · 2025-08-20T17:06:59Z

core/src/lib.rs

+        let worker_instance_key = worker.worker_instance_key();
+        let heartbeat_callback = worker
+            .get_heartbeat_callback()
+            .expect("Worker heartbeat data should exist for non-shared namespace worker");


I'd probably make this a dbg panic that just skips registration if it's not set.

Sushisource · 2025-08-20T17:08:17Z

core/src/lib.rs

+}
+
+impl RuntimeOptions {
+    /// Creates new runtime options.


Since this is non_exhaustive it should probably use a builder pattern rather than a new

cretz · 2025-08-20T17:56:20Z

The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps)

Right, they are associated with them. That's the point. Same for eager workflow start. A worker is associated with a single client, not some other client in the process/runtime.

@Sushisource - to confirm, you don't believe that we need to make sure a worker's heartbeat should occur on the worker's client? I think that is much simpler to understand for those that want to understand what calls use what clients. This is important because users control lifetimes of clients and handle explicit client replacement to update auth and are the ones that control which clients are for which workers.

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

cretz · 2025-08-20T18:02:41Z

core/src/worker/mod.rs

@@ -295,15 +302,18 @@ impl Worker {
        // Unregister worker from current client, register in new client at the end
        let mut worker_key = self.worker_key.lock();
        let slot_provider = (*worker_key).and_then(|k| self.client.workers().unregister(k));
-        self.client
-            .replace_client(super::init_worker_client(&self.config, new_client));
+        self.client.replace_client(super::init_worker_client(


Can you confirm if I replace a worker's client, it is this new client that will be used for its successive heartbeats? For most use cases of this call, the older client's auth/mTLS is invalid or will be at some point, and if you keep using it, you'll start hitting permission denials and failures. Also, you should not assume if someone replaces a single worker's client, that it can be the replacement for all worker's clients that happen to share the same endpoint and namespace.

This is another reason that workers should use their clients to make heartbeat calls.

Sushisource · 2025-08-20T23:28:31Z

I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client).

Sure, I think that's fair (though the actual scenario in the current design where that wouldn't happen seems to be exceptionally rare). I maintain the config option shouldn't live in the client options, though.

worker heartbeat

7790c46

yuandrew requested a review from a team as a code owner August 18, 2025 22:34

cursor bot reviewed Aug 18, 2025

View reviewed changes

cretz reviewed Aug 19, 2025

View reviewed changes

yuandrew changed the title ~~Process-wide worker heartbeat~~ Runtime-wide worker heartbeat Aug 19, 2025

Sushisource reviewed Aug 20, 2025

View reviewed changes

cretz reviewed Aug 20, 2025

View reviewed changes

	pub heartbeat_duration_millis: u64,
	pub worker_heartbeat_duration_millis: u64,

		/// Adds `WorkerHeartbeatData` to the `SharedNamespaceWorker` and adds a `HeartbeatCallback` to
		/// the client that Worker

Runtime-wide worker heartbeat #983

Are you sure you want to change the base?

Runtime-wide worker heartbeat #983

Uh oh!

Conversation

yuandrew commented Aug 18, 2025

What was changed

Why?

Checklist

Uh oh!

cursor bot Aug 18, 2025

Choose a reason for hiding this comment

Bug: Heartbeat Identity Error

Uh oh!

cursor bot Aug 18, 2025

Choose a reason for hiding this comment

Bug: Worker Premature Shutdown Due to Heartbeat Race Condition

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cretz Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource commented Aug 20, 2025

Uh oh!

Uh oh!

cretz Aug 19, 2025 •

edited

Loading

cretz Aug 19, 2025 •

edited

Loading

cretz Aug 19, 2025 •

edited

Loading

cretz commented Aug 20, 2025 •

edited

Loading

cretz Aug 20, 2025 •

edited

Loading