Skip to main content

Tutorial

note

Starlake is not an orchestrator. It generates DAGs for you to run on your orchestrator of choice.

Now that our load and transform are working, we can run them on our orchestrator.

Starlake can generate DAGs for different orchestrators like Airflow, Dagster, and Snowflake Tasks.

Prerequisites

Make sure you run the Transform step first to get the data in the database.

On Snowflake, you need to login with a role with the CREATE TASK and USAGE privileges.

Running the DAG

Using starlake dag-generate command, we can generate a DAG file that will run our load and transform tasks.

starlake dag-generate --clean

This will generate your DAG files in the root of the dags/generated directory.

You may also 'dry-run' the DAGs to see if they are working as expected.

python -m ai.starlake.orchestration --file {{file}} dry-run

where {{file}} is the path to the DAG file you want to run (dags/generated in our example).

note
  • In Snowflake: This will display all the SQL commands that will be run.
  • In Airflow: This will run de DAG in test mode.
  • In Dagster: This will run de DAG in local mode.

We are now ready to deploy the DAGs directly on our orchestrator.

python -m ai.starlake.orchestration --file {{file}} deploy

This will deploy the DAGs to your orchestrator.

Backfilling

You can backfill the DAGs to run them on a specific date range.

python -m ai.starlake.orchestration  \
--file {{file}} backfill \
--start-date {{start_date}} \
--end-date {{end_date}}

Where {{start_date}} and {{end_date}} are the start and end dates of the range you want to run the DAGs on. This will run the DAGs on the specified date range.

Configuration

The DAG generation is based on the configuration files located in the metadata/dags directory. You put there the configuration files for the DAGs you want to generate and reference them globally in the metadata/application.sl.yml file or specifically for each load or transform task through the dagRef attribute.

Load configuration: metadata/dags/airflow_scheduled_table_bash.sl.yml
dag:
comment: "default Airflow DAG configuration for load"
template: "load/airflow_scheduled_table_bash.py.j2"
filename: "airflow_all_tables.py"
options:
pre_load_strategy: "imported"

Transform configuration: metadata/dags/airflow_scheduled_task_bash.sl.yml
dag:
comment: "default Airflow DAG configuration for transform"
template: "transform/airflow_scheduled_task_bash.py.j2"
filename: "airflow_all_tasks.py"
options:
load_dependencies: "true"

pre_load_strategy(none by default): You can specify an optional preload strategy to conditionally load a domain. Valid values are: imported, ack, pending, or none. If not set, the default strategy is none.

  • imported: Loads files from the stage directory.
  • ack: Loads files only if an acknowledgment (ack) file is present in the stage directory.
  • pending: Loads files from the pending directory.
  • none: Disables preloading.

The preload strategy acts as a lightweight pre-check. It simply checks for the presence of files based on the selected strategy. If no matching files are found, the process exits silently—without raising an error or consuming additional resources.

load_dependencies: "true" is a flag that tells the DAG generator to include for transformations, the dependent tables or tasks in the DAG file.

template is the template file that will be used to generate the DAG file. This ay reference :

  • an absolute path on the filesystem
  • a relative path to the metadata/dags/templates/ directory
  • a template that is built-in in the starlake library and is located in the load and transform resource directories.

The load and transforms tasks will be run as bash commands on the orchestrator. To run them on a different executor, you can change the template to a different one or build your own. Starlake comes with templates out of the box.

On Snowflake, the tasks will be run as native snowpark tasks.