Skip to main content

Transformation Capabilities for SQL

1. SELECT Materialization

A transform is a standard SQL SELECT statement stored in a .sql file. Starlake materializes the query result into the target table. If the target table exists, the result schema must match; if it does not exist, Starlake infers the schema from the result.

SELECT
o.order_id,
SUM(ol.quantity * ol.sale_price) AS total_revenue
FROM
starbake.orders o
JOIN starbake.order_lines ol ON o.order_id = ol.order_id
GROUP BY
o.order_id

2. Custom DML Statements

When parseSQL is set to false, Starlake executes the SQL as-is without conversion. This allows MERGE, INSERT, UPDATE, and other DML statements to be used directly.

task:
parseSQL: false

3. Incremental Modelling

Two reserved variables support time-based incremental processing:

  • {{sl_start_date}} — start of the current processing interval.
  • {{sl_end_date}} — end of the current processing interval.
SELECT
product_id,
SUM(amount) AS daily_revenue
FROM sales.transactions
WHERE
transaction_date >= '{{sl_start_date}}'
AND transaction_date < '{{sl_end_date}}'
GROUP BY product_id

A catchup mechanism in the orchestrator (Airflow, Dagster, Snowflake Tasks) replays missed intervals with the correct variable values.

4. Variable Substitution

Any environment variable can be referenced with {{variable}} syntax. Variables are resolved before the SQL is sent to the database engine. This is direct substitution, not Jinja templating.

Built-in variables include: SL_CURRENT_TIMESTAMP, SL_CURRENT_DATE, SL_CURRENT_TIME.

5. SQL Dialect Transpilation

Starlake can transpile SQL across database dialects, allowing portable transforms between engines (BigQuery, Snowflake, Spark, DuckDB, JDBC databases).

6. Pre/Post SQL Hooks

SQL statements can be executed before and after the main transform via presql and postsql.

task:
presql:
- "TRUNCATE TABLE staging.summary"
postsql:
- "ANALYZE TABLE kpi.revenue_summary COMPUTE STATISTICS"

7. Automatic Dependency Detection

Starlake parses FROM and JOIN clauses to build a directed acyclic graph (DAG) of table dependencies. Upstream tables always execute before downstream ones.

# View dependency graph
starlake lineage --task kpi.order_summary --print

# Generate SVG visualization (requires GraphViz)
starlake lineage --task kpi.order_summary --svg --output lineage.svg

8. Recursive Execution

Running a transform with --recursive automatically executes all upstream dependencies first.

starlake transform --recursive --name kpi.order_summary

9. Write Strategies

The writeStrategy.type property controls how results are written to the target table:

StrategyDescription
APPENDInsert all new rows (default)
OVERWRITEReplace all existing rows
UPSERT_BY_KEYMerge by key (update existing, insert new)
UPSERT_BY_KEY_AND_TIMESTAMPMerge by key, update only if incoming timestamp is newer
OVERWRITE_BY_PARTITIONOverwrite only partitions present in the result
DELETE_THEN_INSERTDelete matching keys, then insert all rows
SCD2Slowly Changing Dimension Type 2 (preserve history)

Additional write strategy properties:

  • key — column(s) forming the merge/upsert key.
  • timestamp — timestamp column for ordering (required for UPSERT_BY_KEY_AND_TIMESTAMP, optional for SCD2).
  • startTs / endTs — SCD2 effective date range columns.
  • queryFilter — SQL WHERE clause for merge source selection. Supports latest for BigQuery latest partition and in last(N) syntax for partition range filtering.
  • on — merge condition target: TARGET or SOURCE_AND_TARGET.

10. Materialization Types

The transform output can be materialized as different object types:

TypeDescription
TABLEPhysical table
VIEWSQL view
MATERIALIZED_VIEWPrecomputed materialized view

11. Cross-Database Reads and Writes

  • connectionRef at task level sets the source database.
  • sink.connectionRef sets the target database.
  • A single transform can read from one database and write to another.
task:
connectionRef: source_db
sink:
connectionRef: target_db

12. Partitioning and Clustering

task:
sink:
partition:
- column1
- column2
clustering:
- column3
- column4

13. Schema and Table Override

By default, the directory name becomes the target schema and the file name becomes the target table. Both can be overridden.

task:
domain: custom_schema
table: custom_table

14. Column Documentation

Calculated or derived columns can be documented directly in the YAML configuration. Column-level access policies are also supported.

task:
attributesDesc:
- name: "total_revenue"
comment: "Sum of quantity * sale_price per order"
accessPolicy: "PII"

15. Access Control

Table-level ACL and row-level security (RLS) can be applied to transform outputs.

task:
acl:
- role: SELECT
grants:
- user:[email protected]
- group:[email protected]
- serviceAccount:[email protected]
rls:
- name: "USA only"
predicate: "country = 'USA'"
grants:
- "group:us-team"

16. Post-Transform Expectations

Data quality assertions are evaluated after the transform completes. See the full Expectations reference for all 53 built-in macros covering completeness, validity, volume, schema, uniqueness, and numeric checks.

17. Export to Files

Transform results can be exported to files (CSV, JSON, Parquet, Avro, ORC, XLS) instead of — or in addition to — a database table. The path is relative to root defined in application.sl.yml. Cloud storage paths (GCS, S3, ADLS) are supported.

task:
sink:
format: csv
extension: csv
path: mnt/data/output.csv
coalesce: true

18. DAG Generation and Scheduling

Starlake generates orchestration DAGs from the dependency graph for Airflow, Dagster, or Snowflake Tasks.

starlake dag-generate

Transforms can also define a cron schedule and a dagRef to trigger downstream DAGs after completion.

task:
schedule: "0 2 * * *"
dagRef: "downstream_pipeline"

19. Task Timeout

A taskTimeoutMs property sets the maximum execution time in milliseconds.

task:
taskTimeoutMs: 3600000

20. Freshness Monitoring

The freshness property defines staleness thresholds for the output data.

task:
freshness:
warn: "6h"
error: "1d"

21. Primary Key Declaration

The primaryKey property declares the output table's primary key columns.

task:
primaryKey:
- "order_id"

22. Attribute Sync Strategy

The syncStrategy property controls how the target table schema evolves:

StrategyDescription
NONENo synchronization
ADDAdd new attributes from source (default)
ALLSync all attributes (add/remove to match source)

Summary

CapabilityCategory
SELECT materialization with schema inferenceCore
Custom DML via parseSQL: falseCore
Incremental modelling (sl_start_date / sl_end_date)Core
Variable substitution ({{variable}})Core
SQL dialect transpilationCore
Pre/Post SQL hooks (presql / postsql)Core
Automatic dependency detection (FROM / JOIN parsing)Orchestration
Recursive execution (--recursive)Orchestration
DAG generation (Airflow, Dagster, Snowflake Tasks)Orchestration
Scheduling (schedule, dagRef)Orchestration
Task timeout (taskTimeoutMs)Orchestration
Write strategies (APPEND, OVERWRITE, UPSERT, SCD2, etc.)Write
Materialization types (TABLE, VIEW, MATERIALIZED_VIEW)Write
Cross-database reads and writes (connectionRef)Write
Partitioning and clusteringWrite
Export to files (CSV, JSON, Parquet, Avro, ORC, XLS)Export
Schema and table name overrideConfiguration
Column documentation (attributesDesc)Configuration
Attribute sync strategy (syncStrategy)Configuration
Primary key declarationConfiguration
Access control (ACL, RLS)Security
Column-level access policies (accessPolicy)Security
Post-transform expectationsData Quality
Freshness monitoringData Quality