Python Transforms
Starlake supports Python transforms for operations that go beyond what SQL can express -- machine learning inference, text processing, external API calls, or complex algorithmic logic. Python transforms require a Spark runtime (standalone Spark, Databricks, or EMR). They do not work with DuckDB. The YAML configuration format is identical to SQL transforms: only the source file extension changes from .sql to .py.
Python vs SQL Transforms
| Aspect | SQL Transform | Python Transform |
|---|---|---|
| Source file | .sql | .py |
| Runtime | DuckDB, BigQuery, Snowflake, Spark, Databricks | Spark, Databricks, EMR only |
| Use case | Standard queries, joins, aggregations | ML, text processing, API calls |
| Output mechanism | SELECT result | SL_THIS temporary view |
| YAML configuration | Same format | Same format |
Use SQL transforms for standard data operations. Choose Python when the logic cannot be expressed in SQL.
How to Write a Python Transform
- Create a Python file -- Add a
.pyfile in themetadata/transform/<domain>/directory. The file name determines the task name. - Write the PySpark logic -- Use
SparkSessionto read source tables, apply transformations, and produce a result DataFrame. - Register the result as SL_THIS -- Call
df.createOrReplaceTempView("SL_THIS")on your result DataFrame. Starlake reads this view to materialize the output table. - Add an optional YAML configuration -- Create a
.sl.ymlfile with the same name to set write strategy, partitioning, or expectations. - Run the transform -- Execute
starlake transform --name <domain>.<task>with optional--options key1=value1,key2=value2to pass parameters.
Directory Structure
metadata/transform
└── <domain>
├── <task>.py
└── <task>.sl.yml (optional)
Pass Arguments to the Python Script
Arguments specified with the --options flag are forwarded as command-line arguments:
starlake transform --name <domain>.<transform_name> --options key1=value1,key2=value2
Starlake converts this into:
<transform_name>.py --key1 value1 --key2 value2
Parse these arguments in your script using argparse or sys.argv.
Register the Output as SL_THIS
Before the script exits, it must create a temporary view named SL_THIS. Starlake reads this view and materializes it into the target table using the configured write strategy.
The SL_THIS temporary view is mandatory. Starlake cannot materialize the output without it.
Example: Aggregate Sales Data with PySpark
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
# Read source tables
products = spark.table("starbake.products")
order_lines = spark.table("starbake.order_lines")
# Aggregate revenue by product category
result = (
order_lines
.join(products, "product_id")
.groupBy("category")
.agg(
F.sum(F.col("quantity") * F.col("sale_price")).alias("total_revenue"),
F.sum("quantity").alias("total_units_sold")
)
)
# Register as SL_THIS for Starlake to materialize
result.createOrReplaceTempView("SL_THIS")
Schema Matching
The same rules apply as for SQL transforms:
- If the target table exists, the DataFrame schema must match.
- If the target table does not exist, Starlake infers the schema from the DataFrame.
Next Steps
- SQL Transform Tutorial -- end-to-end walkthrough with DuckDB
- Transform YAML Configuration -- write strategies, partitioning, access control
- SQL Transform Syntax -- SELECT statements, incremental models, column documentation
Frequently Asked Questions
When should you use a Python transform instead of SQL in Starlake?
Python transforms are useful for complex operations that are difficult to express in SQL, such as machine learning, text transformations, or API calls. They require a Spark or Databricks runtime.
How do you pass parameters to a Python script in Starlake?
Use the --options key1=value1,key2=value2 flag in the starlake transform command. The parameters are passed as command-line arguments to the Python script.
What is SL_THIS in a Starlake Python transform?
SL_THIS is the name of the temporary view that the Python script must create before returning. Starlake reads this view to materialize the result into the target table.
Do Starlake Python transforms work with DuckDB?
No. Python transforms require a Spark runtime (standalone, Databricks, or EMR). For DuckDB, use SQL transforms.
Can you use the same YAML file for a Python and SQL transform?
Yes. The YAML format is identical. Only the source file changes: .sql for SQL, .py for Python.
Does the returned DataFrame need a specific schema?
The DataFrame must match the target table schema if it exists. Otherwise, Starlake infers the schema. The DataFrame must be registered as the SL_THIS temporary view.