Task Parameters¶

Parameters define the behaviour of a tasks run method, what the task does. Since Stardag tasks are pydantic BaseModels, we can use all pydantic features and patterns/best practices to declare a task's parameters.

As covered in previous sections, we can also pass other (arbitrarily nested) task instances as parameters; since they are also pydantic BaseModels, this nesting is natural and results in a well-defined JSON schema.

Polymorphism and `TaskLoads[...]`¶

A central feature that Stardag adds on top of standard pydantic is support for generalized polymorphism. Consider the example below:

class TrainedModel(sd.AutoTask[MyModel]):
    config: MyModelConfig  # A regular pydantic model
    dataset: Dataset  # A specific Stardag Task

    def requires(self):
        return self.dataset

    def run(self):
        training_data = self.dataset.output().load()
        model = MyModel(config)
        model.fit(training_data)
        self.output().save(model)
        # ...

Here, we have declared that the dataset must be a specific task of type Dataset. This could be fine, but we typically want to be able to compare different training and test datasets from different sources with different pre-processing etc. and this is typically best reflected by differently composed tasks/DAGs.

Looking closer at the run method, we actually only care about the data type of training_data in:

training_data = self.dataset.output().load()

We can express this by instead using:

MyDataType = ...  # For example a pandas DataFrame with a pandera schema

class TrainedModel(sd.AutoTask[MyModel]):
    config: MyModelConfig  # A regular pydantic model
    dataset: sd.TaskLoads[MyDataType]  # *Any* task, which output().load() -> MyDataType.

TaskLoads[<Type>] is short for any Stardag task for which the return type of output().load() is <Type>.

Parameter Hashing¶

Parameter hashing gives each task instance a unique, deterministic identifier based on its parameters.

Parameter hashing solves several problems:

Deterministic IDs: Same parameters always produce the same task ID
Unique paths: Each configuration gets its own output location
Caching: Re-running with same parameters reuses existing outputs
Composition: Upstream task IDs are included in downstream hashes

The Task ID¶

Every task has an id property:

from uuid import UUID

@sd.task
def add(a: int, b: int) -> int:
    return a + b

task = add(a=1, b=2)
assert task.id == UUID("fa9b74b1-1cde-5676-8650-dbcf755a2699")  # UUID-5

The task ID is derived from:

Task name (class name or function name, unless overridden)
Task namespace
Task version
All parameter values (recursively hashed)

This recursive hashing ensures that:

Changes to upstream parameters change downstream IDs
The full DAG lineage is captured in the hash

Output URIs¶

The task ID should typically determine the output URI, and does so automatically when using the Decorator API or AutoTask:

task = add(a=1, b=2)
print(task.output().uri)
# /path/to/.stardag/local-target-roots/default/add/fa/9b/fa9b74b1-1cde-5676-8650-dbcf755a2699.json

The default path structure is:

<target_root>/[<namespace>/]<name>/<id[0:2]>/<id[2:4]>/<id>.json

The id[0:2]/id[2:4] directory structure prevents having too many files in a single directory (facilitate file browsing in some filesystems).

🚧 Work in progress 🚧

This documentation is still taking shape. It should soon cover:

How (and when) to exclude parameters from hashing -> task ID
How Task ID is obtained in more detail
Customizing hash behaviour
Compatibility mode validation
Task versioning
Best practices (examples for experimental ML and model hyperparameters)

Task Parameters¶

Polymorphism and TaskLoads[...]¶

Parameter Hashing¶

The Task ID¶

Output URIs¶

Polymorphism and `TaskLoads[...]`¶