Core Concepts¶
Understanding these concepts will help you get the most out of Stardag.
π§ Work in progress π§
This section is still taking shape. Questions, feedback, and suggestions are very welcome β feel free to email us or open an issue on GitHub if anything is unclear.
Overview¶
Stardag is built around a few key abstractions:
| Concept | Description |
|---|---|
| Tasks | Units of work that produce outputs and declare dependencies |
| Targets | Where and how outputs are stored |
| Dependencies | How task dependencies are declared |
| Parameters | How to use task parameters and how they are hashed to get the task id |
| AsyncIO | How to use and implement asyncio tasks and targets execution |
| Build & Execution | How DAGs are executed |
The Big Picture¶
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Task A ββββββΆβ Task B ββββββΆβ Task C β
β (upstream) β β (middle) β β (downstream)β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Target A β β Target B β β Target C β
β (persisted) β β (persisted) β β (persisted) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
- Tasks define what to compute and how their output depends on inputs
- Dependencies create the DAG structure by linking task outputs to inputs
- Parameter hashing gives each unique task configuration a deterministic ID
- Targets persist outputs at paths determined by the task ID
- Build traverses the DAG bottom-up, executing only incomplete tasks
Design Philosophy¶
Declarative Over Imperative¶
Tasks are specifications of what to compute, not (only) instructions to execute. This separation enables:
- Inspection before execution
- Serialization of the full DAG
- Efficient caching and skip logic
- Data as Code (DaC)
Moreover, especially in experimental Machine Learning workflows, it can be extremely valuable with a human readable and searchable specification of any asset produced. Each task is a self-contained specification of the complete provenance of its persistently stored target. Done right this allows inspection of the "diff" between the specification of, say, two different instances of ML-model performance metrics; Why is one better than the other? Which hyper-parameters have changed? Is the same training dataset and filtering used?
Composition (Over Inheritance and/or Static DAG Topology)¶
Tasks are composed by passing task instances as parameters. This promotes:
- Loose coupling
- Reusability
- Testability
Determinism¶
Given the same parameters, a task always:
- Has the same ID (via parameter hashing)
- Writes to the same output location
- Produces the same result (assuming pure functions)
The Right Tool for the Job¶
Stardag happily acknowledges that the declarative DAG abstraction is not suitable for all data processing/workflows. That's why its ambition is to be interoperable with other modern data workflow frameworks, such as Prefect, that lacks the declarative DAG abstraction (both as an SDK and at the execution layer).
Mental Model¶
Think of Stardag like a Makefile for Python:
- Each task is like a Make target
- Dependencies define the build order
- If an output file exists, the task is considered complete
- Building starts from the requested target and works backward
The key difference: parameter hashing makes targets automatically unique based on their inputs.