Skip to main content

Orchestration Capabilities

1. DAG Generation

Starlake generates orchestration DAGs automatically from the dependency graph of load and transform jobs. No manual DAG writing is required.

# Generate all DAGs
starlake dag-generate

# Generate with options
starlake dag-generate --clean --domains --tasks --tags tag1

# Custom output directory
starlake dag-generate --outputDir /path/to/dags

2. Supported Orchestrators

OrchestratorExecution Environments
Apache Airflow (2.4.0+)Bash, Cloud Run (GCP), Dataproc (GCP), Fargate (AWS)
Dagster (1.6.0+)Shell, Cloud Run (GCP), Dataproc (GCP), Fargate (AWS)
Snowflake TasksSnowpark SQL (native)

3. Automatic Dependency Resolution

Starlake parses FROM and JOIN clauses in SQL transform files to build a directed acyclic graph (DAG). Upstream tables always execute before downstream ones. Both load-to-transform and transform-to-transform dependencies are detected.

4. Dependency Strategies

Two strategies control how dependencies are executed:

Inline Dependencies

All upstream loads and transforms are included in the same DAG. Simpler, single point of execution.

options:
run_dependencies_first: "true"

Data-Aware Scheduling (default)

Each DAG runs independently. Downstream DAGs are triggered when upstream datasets are updated:

  • Airflow: uses native Airflow Datasets.
  • Dagster: uses Multi Asset Sensors to monitor materializations.
options:
run_dependencies_first: "false"

The dataset_triggering_strategy option controls when downstream DAGs fire:

ValueBehavior
any (default)Any single upstream dataset update triggers the DAG
allAll upstream datasets must be updated before trigger
Custom expressionBoolean expression, e.g. dataset1 & (dataset2 | dataset3)

5. Lineage Visualization

# Print dependency tree as text
starlake lineage --task kpi.order_summary --print

# Generate SVG (requires GraphViz)
starlake lineage --task kpi.order_summary --svg --output lineage.svg

# Other formats
starlake lineage --png --output lineage.png
starlake lineage --json --output lineage.json

# Column-level lineage
starlake col-lineage --task kpi.order_summary

Options: --all (include all tasks), --objects task,table,view (filter by type), --verbose (extra properties).

6. DAG Configuration Hierarchy

DAG references (dagRef) can be set at three levels. Each level overrides the one above:

Project Level

# metadata/application.sl.yml
application:
dagRef:
load: airflow_bash_load
transform: airflow_bash_transform

Domain Level

# metadata/load/{domain}/_config.sl.yml
load:
metadata:
dagRef: airflow_bash_load

Table / Task Level

# metadata/load/{domain}/{table}.sl.yml
table:
metadata:
dagRef: airflow_bash_load

# metadata/transform/{domain}/{task}.sl.yml
task:
dagRef: airflow_bash_transform

7. DAG Definition Structure

Each DAG definition lives in metadata/dags/ as a YAML file referencing a Jinja2 template:

dag:
comment: "Daily sales pipeline"
template: "load/airflow_scheduled_table_bash.py.j2"
filename: "dag_{{domain}}.py"
options:
schedule: "0 2 * * *"
start_date: "2024-01-01"
tags: "sales production"

Filename Variables

The filename property controls DAG granularity:

PatternResult
dag_{{domain}}.pyOne DAG per domain
dag_{{domain}}_{{table}}.pyOne DAG per table
dag_all.pySingle DAG for everything

8. Scheduling Options

OptionDescriptionDefault
scheduleCron expression or preset name
start_dateStart date (YYYY-MM-DD)
end_dateEnd date (Airflow only)
timezoneScheduling timezoneUTC
cron_period_frequencyGranularity: day, week, month, yearweek

Data Cycle Management

OptionDescriptionDefault
data_cycle_enabledEnable data cycle validation
data_cycleFrequency: hourly, daily, weekly, monthly, yearly, or cron
beyond_data_cycle_enabledAllow runs outside cycle windowtrue
min_timedelta_between_runsMinimum seconds between runs900 (15 min)

9. Pre-Load Strategies

Controls when domain tables are loaded within the DAG:

StrategyBehavior
NONE (default)Unconditional load
IMPORTEDChecks for files in landing area, calls sl_import to stage them, skips if none found
PENDINGChecks for files in pending datasets, skips if none found
ACKWaits for an acknowledgment file before loading (configurable path and timeout)
options:
pre_load_strategy: "ack"
global_ack_file_path: "${SL_ROOT}/datasets/pending/{domain}/{date}.ack"
ack_wait_timeout: "3600"

10. Retry and Failure Handling

OptionDescriptionDefault
retriesNumber of retry attempts1
retry_delayDelay between retries (seconds)300
retry_on_failureRetry on failure (Cloud Run / Fargate)false
retry_delay_in_secondsRetry delay for Cloud Run10

Airflow Default DAG Args

options:
default_dag_args: '{"depends_on_past": false, "email_on_failure": false, "retries": 1, "retry_delay": 300}'
max_active_runs: "3"

Async Execution (Cloud Run / Fargate)

OptionDescriptionDefault
cloud_run_asyncEnable async with completion sensortrue
cloud_run_async_poke_intervalPolling interval (seconds)10
fargate_async_poke_intervalPolling interval (seconds)30

11. Execution Environment Options

Bash / Shell

OptionDescription
SL_STARLAKE_PATHPath to starlake CLI (default: starlake)
sl_env_varJSON-encoded environment variables
sl_include_env_varsComma-separated OS env vars to forward (or * for all)

Cloud Run (GCP)

OptionDescription
cloud_run_project_idGCP project ID
cloud_run_job_nameCloud Run job name (required)
cloud_run_regionRegion (default: europe-west1)
cloud_run_service_accountService account

Dataproc (GCP)

OptionDescription
dataproc_project_idGCP project ID
dataproc_nameCluster name (default: dataproc-cluster)
dataproc_regionRegion (default: europe-west1)
dataproc_idle_delete_ttlDelete idle cluster after N seconds (default: 3600)

Fargate (AWS)

OptionDescription
aws_cluster_nameECS cluster name (required)
aws_task_definition_nameTask definition (required)
aws_task_definition_container_nameContainer name (required)
aws_task_private_subnetsJSON array of subnet IDs (required)
aws_task_security_groupsJSON array of security group IDs (required)
cpuCPU units (default: 1024)
memoryMemory MB (default: 2048)

Snowflake Tasks

OptionDescription
stage_locationSnowflake stage path (required, e.g., @my_stage/path)
warehouseWarehouse name
packagesPython packages (default: croniter,python-dateutil)
allow_overlapping_executionAllow backfill (default: false)

12. Template Customization

DAG templates are Jinja2 files. Starlake ships built-in templates for every orchestrator × environment combination. Custom templates can be placed in metadata/dags/templates/.

Template resolution order:

  1. Absolute path
  2. Relative to metadata/dags/templates/
  3. Built-in from starlake resources

Transform Parameters

jobs = {
"domain.transform": {
"options": "param1=value1,param2=value2"
}
}

Airflow User-Defined Macros

user_defined_macros = {
"days_interval": custom_function
}

13. Deployment Workflow

  1. Generatestarlake dag-generate analyzes YAML configs and SQL dependencies, produces Python/SQL files.
  2. Deploy — copy generated files to the orchestrator (Airflow dags/ folder, Dagster repository, or Snowflake stage).
  3. Backfill (optional) — replay historical intervals with correct sl_start_date / sl_end_date values.

Summary

CapabilityCategory
Automatic DAG generation from YAML + SQL dependenciesCore
Airflow, Dagster, Snowflake Tasks supportOrchestrators
Bash, Cloud Run, Dataproc, Fargate executionEnvironments
SQL-based dependency resolution (FROM / JOIN parsing)Dependencies
Inline vs data-aware scheduling strategiesDependencies
Dataset triggering (any / all / custom expression)Dependencies
Lineage visualization (text, SVG, PNG, JSON, column-level)Observability
Three-level dagRef hierarchy (project, domain, table)Configuration
Filename variables for DAG granularityConfiguration
Cron scheduling and data cycle managementScheduling
Pre-load strategies (NONE, IMPORTED, PENDING, ACK)Scheduling
Retry, timeout, and async completion sensorsReliability
Custom Jinja2 templatesExtensibility
Transform parameters and user-defined macrosExtensibility