Skip to main content

SQL Transforms

Starlake SQL transforms let you materialize any SELECT statement into a target table using a configurable write strategy. This page covers SELECT syntax, column documentation with attributesDesc, custom MERGE/INSERT via parseSQL: false, and incremental modelling with the sl_start_date / sl_end_date environment variables. All SQL is standard -- no Jinja templating is involved.

Variable substitution syntax

The {{variable}} syntax used in SQL files (e.g. {{sl_start_date}}) is Starlake's environment variable substitution, not Jinja templating. These variables are resolved at runtime before the SQL is sent to the database engine. You can use any environment variable with this syntax.

SELECT Statement Materialization

A SQL transform is a SELECT statement that Starlake materializes into a target table using the write strategy defined in the companion YAML file. The SELECT can be as simple or as complex as needed -- single table queries, multi-table joins, subqueries, and window functions all work.

Schema matching rules:

  • If the target table exists, the SELECT result must match its schema.
  • If the target table does not exist, Starlake infers the schema (column names and types) from the SELECT result.
  • Column documentation from source tables propagates automatically to the target.

Document Calculated Columns

Starlake infers column metadata from source tables. For calculated columns that do not exist in any source, add documentation using attributesDesc in the YAML configuration:

metadata/transform/<domain>/<table>.sl.yml
task:
...
attributesDesc:
- name: <column_name>
comment: <column_description>
...

This metadata appears in the data catalog and helps downstream consumers understand derived fields.

Custom Write Strategy with parseSQL

If none of the predefined write strategies fits your use case, write your own MERGE, INSERT, or UPDATE statement directly in the SQL file. Set parseSQL to false so Starlake executes the SQL as-is instead of wrapping the SELECT in a materialization statement:

metadata/transform/<domain>/<table>.sl.yml
task:
...
parseSQL: false
...

With parseSQL: false, you have full control over the DML statement. Starlake does not modify or wrap the SQL.

Incremental Modelling

When data arrives periodically, you often need to process only the new data for each interval. Starlake supports incremental transforms through two reserved environment variables:

  • sl_start_date -- start of the current processing interval
  • sl_end_date -- end of the current processing interval

Use these variables in a WHERE clause to filter input data:

metadata/transform/<domain>/daily_sales.sql
SELECT
product_id,
SUM(amount) AS daily_revenue,
COUNT(*) AS transaction_count
FROM
sales.transactions
WHERE
transaction_date >= '{{sl_start_date}}'
AND transaction_date < '{{sl_end_date}}'
GROUP BY
product_id

The orchestrator (Airflow, Dagster, or Snowflake Tasks) passes the correct values for each execution interval. The computed results append to the existing table, which greatly reduces the volume of data processed on each run.

Handle Missed Intervals with Catchup

If an incremental job fails for a specific day, Starlake integrates with Airflow's Catchup mechanism. Add the catchup flag to your YAML DAG definition, and the orchestrator replays missed intervals with the correct sl_start_date and sl_end_date values.

Next Steps

Frequently Asked Questions

Does the SELECT result have to match the target table schema?

If the target table already exists, the SELECT result must match its schema. Otherwise, Starlake automatically infers the schema from the result.

How do you document calculated columns in a SQL transform?

Use the attributesDesc property in the transform YAML file to add a name and comment for each calculated column.

Can you write custom MERGE or INSERT instead of a SELECT?

Yes. Set parseSQL: false in the YAML configuration. Starlake executes the SQL as-is without converting it into a write statement.

What are the sl_start_date and sl_end_date variables?

They are reserved environment variables representing the current time interval. The orchestrator (Airflow, Dagster, or Snowflake Tasks) passes them automatically for incremental executions.

What happens if an incremental job fails for a day?

Starlake integrates with Airflow's Catchup mechanism. Missed intervals are automatically replayed with the correct sl_start_date and sl_end_date values.

Does Starlake support Jinja templating in SQL files?

No. Starlake SQL files use standard SQL. Environment variables are injected directly without a Jinja layer.

How does schema inference work when the target table does not exist?

Starlake executes the SELECT and infers the schema (types, column names) from the result. Source column documentation is automatically propagated.