Skip to main content

Options Reference

This page documents every option you can pass in the options dictionary of your DAG configuration. Options are organized by scope: common options apply to all orchestrators, while backend-specific options only apply to a given orchestrator or execution environment.

All options are string values in YAML:

dag:
options:
option_name: "value"

Common options

These options are recognized by all orchestrators and execution environments.

sl_env_var

Starlake environment variables passed as a JSON-encoded string. These variables are injected into the execution environment of every task.

TypeJSON string
Default{}
RequiredNo
dag:
options:
sl_env_var: '{"SL_ROOT": "/opt/starlake", "SL_DATASETS": "/opt/starlake/datasets", "SL_ENV": "PROD", "SL_TIMEZONE": "Europe/Paris"}'

Common variables inside sl_env_var:

VariableDescription
SL_ROOTRoot directory of the Starlake project
SL_DATASETSDatasets directory (defaults to ${SL_ROOT}/datasets)
SL_ENVEnvironment name (e.g., SNOWFLAKE, BIGQUERY, SPARK)
SL_TIMEZONETimezone for scheduling
SL_LOG_LEVELLogging level

pre_load_strategy

Defines the strategy used to conditionally load the tables of a domain.

Typenone | imported | ack | pending
Defaultnone
RequiredNo
dag:
options:
pre_load_strategy: "imported"

NONE

No pre-load check. The domain loads unconditionally.

IMPORTED

Checks that at least one file exists in the landing area (${SL_ROOT}/incoming/{domain} by default). If files are found, sl_import is called to import the domain before loading. Otherwise, loading is skipped silently.

pre_load_strategy: "imported"

PENDING

Checks that at least one file exists in the pending datasets area (${SL_ROOT}/datasets/pending/{domain} by default). Otherwise, loading is skipped.

pre_load_strategy: "pending"

ACK

Checks that an acknowledgment file exists at the specified path (${SL_ROOT}/datasets/pending/{domain}/{date}.ack by default). Otherwise, loading is skipped.

pre_load_strategy: "ack"

Related options for ACK strategy:

OptionDefaultDescription
global_ack_file_path${SL_DATASETS}/pending/{domain}/{date}.ackPath to the acknowledgment file
ack_wait_timeout3600 (1 hour)Timeout in seconds to wait for the ACK file

run_dependencies_first

When set to true, all dependencies (upstream tables and tasks) for each transformation are generated as tasks within the same DAG. When false (the default), the orchestrator's native data-aware scheduling mechanism is used instead.

Typetrue | false (string)
Defaultfalse
RequiredNo
dag:
options:
run_dependencies_first: "true"

See the Dependencies section for a detailed explanation of both strategies.

retries

Number of times to retry a failed task.

Typeinteger (as string)
Default1
RequiredNo

retry_delay

Delay in seconds between retries.

Typeinteger (as string)
Default300
RequiredNo

start_date

The start date for DAG scheduling, in YYYY-MM-DD format.

Typedate string
DefaultFile modification date of the DAG
RequiredNo

timezone

Timezone used for scheduling.

Typestring
DefaultUTC
RequiredNo

tags

Tags applied to the generated DAG, visible in the orchestrator's UI.

Typestring (space-separated)
Default(none)
RequiredNo
dag:
options:
tags: "starlake production finance"

cron_period_frequency

The frequency granularity for cron-based scheduling.

Typeday | week | month | year
Defaultweek
RequiredNo

dataset_triggering_strategy

Controls how dataset dependencies trigger a DAG run when using data-aware scheduling (run_dependencies_first: "false").

Typeany | all | custom boolean expression
Defaultany
RequiredNo
  • any — Any single upstream dataset update triggers the DAG
  • all — All upstream datasets must be updated before the DAG triggers
  • Custom expression — A boolean expression combining dataset names with & (AND) and | (OR), e.g. dataset1 & (dataset2 | dataset3)

optional_dataset_enabled

Whether datasets can be optional in dependency resolution.

Typetrue | false
Defaultfalse
RequiredNo

data_cycle_enabled

Enables data cycle management for verifying data dependencies.

Typetrue | false
Defaultfalse
RequiredNo

data_cycle

The data cycle frequency. Only used when data_cycle_enabled is true.

Typehourly | daily | weekly | monthly | yearly | cron expression
Defaultnone
RequiredNo (only relevant when data_cycle_enabled is true)

beyond_data_cycle_enabled

Whether to allow runs beyond the data cycle window.

Typetrue | false
Defaulttrue
RequiredNo

min_timedelta_between_runs

Minimum time in seconds between two consecutive DAG runs.

Typeinteger (as string)
Default900 (15 minutes)
RequiredNo

Airflow options

These options only apply when using Airflow as the orchestrator.

default_pool

The Airflow pool to use for all tasks in the DAG.

Typestring
Defaultdefault_pool
RequiredNo
dag:
options:
default_pool: "starlake_pool"

max_active_runs

Maximum number of concurrent active DAG runs.

Typeinteger (as string)
Default3
RequiredNo

end_date

The end date for DAG scheduling, in YYYY-MM-DD format.

Typedate string
Default(none — runs indefinitely)
RequiredNo

default_dag_args

Overrides for the default Airflow DAG arguments, as a JSON-encoded string.

TypeJSON string
Default{"depends_on_past": false, "email_on_failure": false, "email_on_retry": false, "retries": 1, "retry_delay": 300}
RequiredNo
dag:
options:
default_dag_args: '{"depends_on_past": true, "email_on_failure": true}'

Execution environment options

Each orchestrator supports different execution environments. The execution environment is determined by the template you choose. The following sections document the options specific to each execution environment.

Shell / Bash

Used for on-premise execution of the starlake CLI command directly.

Applies to: StarlakeAirflowBashJob, StarlakeDagsterShellJob

OptionDefaultRequiredDescription
SL_STARLAKE_PATHstarlakeNoPath to the starlake executable
sl_include_env_varsGOOGLE_APPLICATION_CREDENTIALS,AWS_KEY_ID,AWS_SECRET_KEYNoComma-separated list of OS environment variables to forward to the bash command. Use * or _ to forward all.
Airflow Bash example
dag:
template: "load/airflow_scheduled_table_bash.py.j2"
options:
SL_STARLAKE_PATH: "/usr/local/bin/starlake"
sl_include_env_vars: "GOOGLE_APPLICATION_CREDENTIALS,AWS_KEY_ID,AWS_SECRET_KEY"

GCP Cloud Run

Executes starlake commands by running a Cloud Run job.

Applies to: StarlakeAirflowCloudRunJob, StarlakeDagsterCloudRunJob

OptionDefaultRequiredDescription
cloud_run_project_id$GCP_PROJECT env varNoGCP project ID
cloud_run_job_name(none)YesName of the Cloud Run job to execute
cloud_run_job_region$GCP_REGION env varNoRegion where the Cloud Run job is deployed
cloud_run_service_account""NoService account for the Cloud Run job
cloud_run_asynctrueNoRun the job asynchronously (Airflow only)
cloud_run_async_poke_interval10NoPolling interval in seconds when async (Airflow only)
retry_on_failurefalseNoRetry the Cloud Run job on failure
retry_delay_in_seconds10NoDelay in seconds before retrying
Airflow Cloud Run example
dag:
template: "load/airflow_scheduled_table_cloud_run.py.j2"
options:
cloud_run_project_id: "my-project"
cloud_run_job_name: "starlake-transform"
cloud_run_job_region: "europe-west1"
cloud_run_async: "true"

When asynchronous execution is enabled (default), a completion sensor polls for job completion:

When synchronous:

GCP Dataproc

Submits starlake commands as Spark jobs on a Dataproc cluster.

Applies to: StarlakeAirflowDataprocJob, StarlakeDagsterDataprocJob

Cluster options

OptionDefaultRequiredDescription
cluster_config_nameDAG filename (lowercased)NoCluster configuration name identifier
dataproc_project_id$GCP_PROJECT env varNoGCP project ID
dataproc_regioneurope-west1NoGCP region for the Dataproc cluster
dataproc_subnetdefaultNoVPC subnet for the cluster
dataproc_service_accountAuto-generated from project IDNoService account for the cluster
dataproc_image_version2.2-debian12NoDataproc image version
dataproc_namedataproc-clusterNoName of the Dataproc cluster
dataproc_idle_delete_ttl3600NoTTL in seconds before idle cluster is deleted
dataproc_cluster_metadata{}NoCluster metadata as JSON

Master node options

OptionDefaultDescription
dataproc_master_machine_typen1-standard-4Machine type for the master node
dataproc_master_disk_typepd-standardDisk type for the master node
dataproc_master_disk_size1024Disk size in GB for the master node

Worker node options

OptionDefaultDescription
dataproc_num_workers4Number of worker instances
dataproc_worker_machine_typen1-standard-4Machine type for worker nodes
dataproc_worker_disk_typepd-standardDisk type for worker nodes
dataproc_worker_disk_size1024Disk size in GB for worker nodes

Spark options

OptionDefaultDescription
spark_jar_list(none)Comma-separated list of JAR files to include
spark_job_main_classai.starlake.job.MainMain class for the Spark job
spark_bucket(none)GCS bucket for Spark event logs and temporary storage
spark_executor_memory(none)Spark executor memory (e.g., 11g)
spark_executor_cores(none)Spark executor cores (e.g., 4)
spark_executor_instances(none)Number of Spark executor instances
spark_config_name{domain}.{table} or {transform_name}Spark config identifier for per-task configuration

Starlake environment options for Dataproc

These are set as Spark properties on the cluster:

OptionDefaultDescription
SL_HIVEfalseEnable Hive support
SL_GROUPEDtrueEnable grouped execution
SL_AUDIT_SINK_TYPEBigQuerySinkAudit sink type
SL_SINK_REPLAY_TO_FILEfalseReplay to file (disabled for performance)
SL_MERGE_OPTIMIZE_PARTITION_WRITEtrueOptimize partition writes
SL_SPARK_SQL_SOURCES_PARTITION_OVERWRITE_MODEdynamicPartition overwrite mode
Dataproc example
dag:
template: "load/airflow_scheduled_table_dataproc.py.j2"
options:
dataproc_project_id: "my-project"
dataproc_region: "europe-west1"
dataproc_master_machine_type: "n1-standard-8"
dataproc_num_workers: "4"
spark_jar_list: "gs://my-bucket/jars/starlake.jar"
spark_bucket: "my-spark-bucket"

AWS Fargate

Executes starlake commands as ECS tasks on AWS Fargate.

Applies to: StarlakeAirflowFargateJob, StarlakeDagsterFargateJob

OptionDefaultRequiredDescription
aws_conn_idaws_defaultNoAirflow connection ID for AWS
aws_profiledefaultNoAWS profile name
aws_regioneu-west-3NoAWS region
aws_cluster_name(none)YesECS cluster name
aws_task_definition_name(none)YesECS task definition name
aws_task_definition_container_name(none)YesContainer name in the task definition
aws_task_private_subnets[]YesJSON array of private subnet IDs
aws_task_security_groups[]YesJSON array of security group IDs
cpu1024NoCPU units for the container override
memory2048NoMemory in MB for the container override
AWS_SDK/usr/local/aws-cliNoPath to AWS SDK
fargate_async_poke_interval30NoPolling interval in seconds for task completion
retry_on_failurefalseNoRetry the Fargate task on failure
Fargate example
dag:
template: "load/airflow_scheduled_table_fargate.py.j2"
options:
aws_cluster_name: "starlake-cluster"
aws_task_definition_name: "starlake-transform"
aws_task_definition_container_name: "starlake"
aws_task_private_subnets: '["subnet-abc123", "subnet-def456"]'
aws_task_security_groups: '["sg-abc123"]'
aws_region: "eu-west-1"
cpu: "2048"
memory: "4096"

Snowflake Tasks (SQL)

Executes starlake commands as native Snowflake stored procedures within Snowflake DAGs.

Applies to: StarlakeSnowflakeJob

OptionDefaultRequiredDescription
stage_location(none)YesSnowflake stage location for stored procedure code (e.g., @my_stage/path)
warehouse(none)NoSnowflake warehouse name
packagescroniter,python-dateutilNoComma-separated list of Python packages available in stored procedures
sl_incoming_file_stage(none)NoSnowflake stage for incoming files (required for load tasks)
allow_overlapping_executionfalseNoAllow overlapping DAG executions for backfill support
Snowflake example
dag:
template: "transform/snowflake_scheduled_transform_sql.py.j2"
options:
stage_location: "@starlake_stage/code"
warehouse: "COMPUTE_WH"
packages: "croniter,python-dateutil,requests"
allow_overlapping_execution: "true"

Complete examples

Airflow + Bash (on-premise)

metadata/dags/airflow_bash_load.sl.yml
dag:
comment: "Load all tables using bash on Airflow"
template: "load/airflow_scheduled_table_bash.py.j2"
filename: "airflow_all_tables.py"
options:
sl_env_var: '{"SL_ROOT": "/opt/starlake", "SL_DATASETS": "/opt/starlake/datasets"}'
SL_STARLAKE_PATH: "/usr/local/bin/starlake"
pre_load_strategy: "imported"
retries: "2"
retry_delay: "60"
tags: "starlake load"
default_pool: "starlake_pool"
metadata/dags/airflow_bash_transform.sl.yml
dag:
comment: "Run all transforms using bash on Airflow"
template: "transform/airflow_scheduled_task_bash.py.j2"
filename: "airflow_all_tasks.py"
options:
sl_env_var: '{"SL_ROOT": "/opt/starlake", "SL_DATASETS": "/opt/starlake/datasets"}'
SL_STARLAKE_PATH: "/usr/local/bin/starlake"
run_dependencies_first: "true"
retries: "2"
retry_delay: "60"
tags: "starlake transform"

Airflow + Cloud Run (GCP)

metadata/dags/airflow_cloud_run_load.sl.yml
dag:
comment: "Load tables using Cloud Run on Airflow"
template: "load/airflow_scheduled_table_cloud_run.py.j2"
filename: "{{domain}}_cloud_run_load.py"
options:
sl_env_var: '{"SL_ROOT": "gs://my-bucket/starlake"}'
cloud_run_project_id: "my-project"
cloud_run_job_name: "starlake-load"
cloud_run_job_region: "europe-west1"
cloud_run_async: "true"
pre_load_strategy: "imported"
tags: "starlake cloud_run load"

Airflow + Dataproc (GCP)

metadata/dags/airflow_dataproc_transform.sl.yml
dag:
comment: "Run transforms on Dataproc via Airflow"
template: "transform/airflow_scheduled_task_dataproc.py.j2"
filename: "{{domain}}_dataproc_tasks.py"
options:
dataproc_project_id: "my-project"
dataproc_region: "europe-west1"
dataproc_master_machine_type: "n1-standard-8"
dataproc_num_workers: "4"
spark_jar_list: "gs://my-bucket/jars/starlake.jar"
spark_bucket: "my-spark-bucket"
run_dependencies_first: "true"
tags: "starlake dataproc transform"

Airflow + Fargate (AWS)

metadata/dags/airflow_fargate_transform.sl.yml
dag:
comment: "Run transforms on Fargate via Airflow"
template: "transform/airflow_scheduled_task_fargate.py.j2"
filename: "{{domain}}_fargate_tasks.py"
options:
aws_cluster_name: "starlake-cluster"
aws_task_definition_name: "starlake-transform"
aws_task_definition_container_name: "starlake"
aws_task_private_subnets: '["subnet-abc123"]'
aws_task_security_groups: '["sg-abc123"]'
aws_region: "eu-west-1"
run_dependencies_first: "true"
tags: "starlake fargate transform"

Dagster + Shell (on-premise)

metadata/dags/dagster_shell_load.sl.yml
dag:
comment: "Load tables using shell on Dagster"
template: "load/dagster_scheduled_table_shell.py.j2"
filename: "dagster_all_load.py"
options:
sl_env_var: '{"SL_ROOT": "/opt/starlake"}'
SL_STARLAKE_PATH: "/usr/local/bin/starlake"
pre_load_strategy: "imported"

Dagster + Cloud Run (GCP)

metadata/dags/dagster_cloud_run_transform.sl.yml
dag:
comment: "Run transforms using Cloud Run on Dagster"
template: "transform/dagster_scheduled_task_cloud_run.py.j2"
filename: "dagster_all_tasks.py"
options:
sl_env_var: '{"SL_ROOT": "gs://my-bucket/starlake"}'
cloud_run_project_id: "my-project"
cloud_run_job_name: "starlake-transform"
cloud_run_job_region: "europe-west1"
run_dependencies_first: "true"

Snowflake Tasks

metadata/dags/snowflake_transform.sl.yml
dag:
comment: "Run transforms as Snowflake Tasks"
template: "transform/snowflake_scheduled_transform_sql.py.j2"
filename: "snowflake_{{domain}}_tasks.py"
options:
sl_env_var: '{"SL_ROOT": ".", "SL_ENV": "SNOWFLAKE"}'
stage_location: "@starlake_stage/code"
warehouse: "COMPUTE_WH"
packages: "croniter,python-dateutil"
run_dependencies_first: "true"
allow_overlapping_execution: "true"
metadata/dags/snowflake_load.sl.yml
dag:
comment: "Load tables as Snowflake Tasks"
template: "load/snowflake_load_sql.py.j2"
filename: "snowflake_{{domain}}_{{table}}.py"
options:
sl_env_var: '{"SL_ROOT": ".", "SL_ENV": "SNOWFLAKE"}'
stage_location: "@starlake_stage/code"
warehouse: "COMPUTE_WH"
sl_incoming_file_stage: "@incoming_stage"