Feature: Support spinning a single metaflow step (Rebased) #2506

talsperre · 2025-07-20T21:53:59Z

To test spin on a new flow you can do the following:

Simple case:

python <flow_name.py> --environment=conda spin <step_name>

Pass in specific pathspec:

python runtime_dag_flow.py --environment=conda spin --spin-pathspec RuntimeDAGFlow/13/step_c/275232971

Pass in custom artifacts via module:

python runtime_dag_flow.py spin --spin-pathspec RuntimeDAGFlow/13/step_d/275233082 --artifacts-module ./my_artifacts.py

Skip decorators (including the whitelisted ones):

python complex_dag_flow.py --environment=conda spin step_d --skip-decorators

Use with Runner API:

    with Runner('complex_dag_flow.py', environment="conda").spin(
        step_name,
        spin_pathspec="<Some Val>",
        artifacts_module='./artifacts/complex_dag_step_d.py',
    ) as spin:
        print("-" * 50)
        print(f"Running test for step: step_a")
        spin_task = spin.task
        print(f"my_output: {spin_task['my_output']}")
        assert spin_task['my_output'].data == [-1]

See the tests for more examples on hot to use this command.

mt-ob · 2025-07-23T19:56:25Z

metaflow/datastore/task_datastore.py

@@ -250,6 +254,70 @@ def init_task(self):
        """
        self.save_metadata({self.METADATA_ATTEMPT_SUFFIX: {"time": time.time()}})

+    @only_if_not_done
+    @require_mode("w")
+    def transfer_artifacts(self, other_datastore, names=None):


minor nit: can we add types here for the args i.e. other_datastore and names

mt-ob · 2025-07-23T20:20:24Z

metaflow/datastore/content_addressed_store.py

@@ -65,6 +65,9 @@ def save_blobs(self, blob_iter, raw=False, len_hint=0):
            Whether to save the bytes directly or process them, by default False


can we add default False for raw instead of optional

mt-ob · 2025-07-23T20:54:10Z

metaflow/datastore/flow_datastore.py

@@ -265,3 +374,11 @@ def load_data(self, keys, force_raw=False):
        """
        for key, blob in self.ca_store.load_blobs(keys, force_raw=force_raw):
            yield key, blob
+
+
+class MetadataCache(object):


can this be made an abstract class instead? with the methods raising NotImplementedError?

mt-ob · 2025-07-23T20:59:53Z

metaflow/util.py

@@ -178,6 +179,45 @@ def resolve_identity():
    return "%s:%s" % (identity_type, identity_value)


+def get_latest_task_pathspec(flow_name: str, step_name: str) -> (str, str):


I guess this returns a Task instance so the function can be renamed since it doesn't just return the pathspec

mt-ob · 2025-07-23T21:04:07Z

metaflow/util.py

+    import importlib.util
+
+    try:
+        spec = importlib.util.spec_from_file_location("artifacts_module", file_path)


"artifacts_module" can be replaced with actual name of the file such as

os.path.splitext(os.path.basename(file_path))[0]

mt-ob · 2025-07-23T21:12:06Z

metaflow/task.py

+        if self.orig_flow_datastore:
+            # We filter only the whitelisted decorators in case of spin step.
+            decorators = [
+                deco for deco in decorators if deco.name in whitelist_decorators


maybe an additional check for whitelist_decorators is not None or [] can be helpful here?

mt-ob · 2025-07-23T21:32:07Z

metaflow/client/filecache.py

@@ -63,8 +65,8 @@ def __init__(self, cache_dir=None, max_size=None):
        # when querying for sizes of artifacts. Once we have queried for the size
        # of one artifact in a TaskDatastore, caching this means that any
        # queries on that same TaskDatastore will be quick (since we already
-        # have all the metadata)
-        self._task_metadata_caches = OrderedDict()
+        # have all the metadata). We keep track of this in a file so it persists


I wonder if same should be done for self._store_caches? which is right now an OrderedDict

mt-ob · 2025-07-23T21:37:55Z

metaflow/client/core.py

+        # Get the parent steps
+        steps = []
+        for node_name, attributes in graph_info["steps"].items():
+            if step_name in attributes["next"]:


I wonder if in_funcs can be used here somehow...

mt-ob · 2025-07-23T21:52:44Z

metaflow/client/core.py


-        yield from self._iter_matching_tasks(steps, "foreach-execution-path", pattern)
+        metadata_key = "foreach-execution-path"


this can be declared way earlier and then re-used instead of current_path = metadata_dict.get("foreach-execution-path")

mt-ob · 2025-07-23T21:53:25Z

metaflow/client/core.py

+                            target_depth = current_depth - 1
+                        pattern = ",".join(current_path.split(",")[:target_depth])
+
+        metadata_key = "foreach-execution-path"


same comment as before about declaring earlier

mt-ob · 2025-07-23T21:55:41Z

metaflow/runtime.py

+
+        self._step_func = step_func
+
+        # Verify whether the use has provided step-name or spin-pathspec


nit: the use --> the user

mt-ob · 2025-07-24T00:45:49Z

metaflow/cli_components/run_cmds.py

+    default=True,
+    show_default=True,
+    help="Whether to persist the artifacts in the spun step. If set to False, "
+    "the artifacts will notbe persisted and will not be available in the spun step's "


nit: notbe --> not be

mt-ob · 2025-07-24T00:52:32Z

metaflow/cli_components/step_cmd.py

+        echo = echo_always
+
+    if opt_namespace is not None:
+        namespace(opt_namespace or None)


we already checked None in the if statement
so can avoid opt_namespace or None I guess

mt-ob · 2025-07-24T00:58:55Z

metaflow/cli_components/step_cmd.py

+    spin_artifacts = read_artifacts_module(artifacts_module) if artifacts_module else {}
+    from_start("SpinStep: read artifacts module")
+
+    ds_type, ds_root = orig_flow_datastore.split("@")


can this be None too? because that's the default value when defining this arg

mt-ob · 2025-07-24T01:03:49Z