Python Transforms

Q: How do you pass parameters to a Python script in Starlake?

Use the --options key1=value1,key2=value2 flag in the starlake transform command. The parameters are passed as command-line arguments to the Python script.

Q: What is SL_THIS in a Starlake Python transform?

SL_THIS is the name of the temporary view that the Python script must create before returning. Starlake reads this view to materialize the result into the target table.

Starlake supports Python transforms for operations that go beyond what SQL can express -- machine learning inference, text processing, external API calls, or complex algorithmic logic. Python transforms require a Spark runtime (standalone Spark, Databricks, or EMR). They do not work with DuckDB. The YAML configuration format is identical to SQL transforms: only the source file extension changes from .sql to .py.

Python vs SQL Transforms

Aspect	SQL Transform	Python Transform
Source file	`.sql`	`.py`
Runtime	DuckDB, BigQuery, Snowflake, Spark, Databricks	Spark, Databricks, EMR only
Use case	Standard queries, joins, aggregations	ML, text processing, API calls
Output mechanism	SELECT result	`SL_THIS` temporary view
YAML configuration	Same format	Same format

Use SQL transforms for standard data operations. Choose Python when the logic cannot be expressed in SQL.

How to Write a Python Transform

Create a Python file -- Add a .py file in the metadata/transform/<domain>/ directory. The file name determines the task name.
Write the PySpark logic -- Use SparkSession to read source tables, apply transformations, and produce a result DataFrame.
Register the result as SL_THIS -- Call df.createOrReplaceTempView("SL_THIS") on your result DataFrame. Starlake reads this view to materialize the output table.
Add an optional YAML configuration -- Create a .sl.yml file with the same name to set write strategy, partitioning, or expectations.
Run the transform -- Execute starlake transform --name <domain>.<task> with optional --options key1=value1,key2=value2 to pass parameters.

Directory Structure

metadata/transform
└── <domain>
    ├── <task>.py
    └── <task>.sl.yml  (optional)

Pass Arguments to the Python Script

Arguments specified with the --options flag are forwarded as command-line arguments:

starlake transform --name <domain>.<transform_name> --options key1=value1,key2=value2

Starlake converts this into:

<transform_name>.py --key1 value1 --key2 value2

Parse these arguments in your script using argparse or sys.argv.

Register the Output as SL_THIS

Before the script exits, it must create a temporary view named SL_THIS. Starlake reads this view and materializes it into the target table using the configured write strategy.

note

The SL_THIS temporary view is mandatory. Starlake cannot materialize the output without it.

Example: Aggregate Sales Data with PySpark

metadata/transform/analytics/sales_by_category.py
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()

    # Read source tables
    products = spark.table("starbake.products")
    order_lines = spark.table("starbake.order_lines")

    # Aggregate revenue by product category
    result = (
        order_lines
        .join(products, "product_id")
        .groupBy("category")
        .agg(
            F.sum(F.col("quantity") * F.col("sale_price")).alias("total_revenue"),
            F.sum("quantity").alias("total_units_sold")
        )
    )

    # Register as SL_THIS for Starlake to materialize
    result.createOrReplaceTempView("SL_THIS")

Schema Matching

The same rules apply as for SQL transforms:

If the target table exists, the DataFrame schema must match.
If the target table does not exist, Starlake infers the schema from the DataFrame.

Next Steps

SQL Transform Tutorial -- end-to-end walkthrough with DuckDB
Transform YAML Configuration -- write strategies, partitioning, access control
SQL Transform Syntax -- SELECT statements, incremental models, column documentation

Frequently Asked Questions

When should you use a Python transform instead of SQL in Starlake?

Python transforms are useful for complex operations that are difficult to express in SQL, such as machine learning, text transformations, or API calls. They require a Spark or Databricks runtime.

How do you pass parameters to a Python script in Starlake?

Use the --options key1=value1,key2=value2 flag in the starlake transform command. The parameters are passed as command-line arguments to the Python script.

What is SL_THIS in a Starlake Python transform?

SL_THIS is the name of the temporary view that the Python script must create before returning. Starlake reads this view to materialize the result into the target table.

Do Starlake Python transforms work with DuckDB?

No. Python transforms require a Spark runtime (standalone, Databricks, or EMR). For DuckDB, use SQL transforms.

Can you use the same YAML file for a Python and SQL transform?

Yes. The YAML format is identical. Only the source file changes: .sql for SQL, .py for Python.

Does the returned DataFrame need a specific schema?

The DataFrame must match the target table schema if it exists. Otherwise, Starlake infers the schema. The DataFrame must be registered as the SL_THIS temporary view.

Python vs SQL Transforms​

How to Write a Python Transform​

Directory Structure​

Pass Arguments to the Python Script​

Register the Output as SL_THIS​

Example: Aggregate Sales Data with PySpark​

Schema Matching​

Next Steps​

Frequently Asked Questions​

When should you use a Python transform instead of SQL in Starlake?​

How do you pass parameters to a Python script in Starlake?​

What is SL_THIS in a Starlake Python transform?​

Do Starlake Python transforms work with DuckDB?​

Can you use the same YAML file for a Python and SQL transform?​

Does the returned DataFrame need a specific schema?​