Skip to main content

Python Transforms

Starlake supports Python transforms for operations that go beyond what SQL can express -- machine learning inference, text processing, external API calls, or complex algorithmic logic. Python transforms require a Spark runtime (standalone Spark, Databricks, or EMR). They do not work with DuckDB. The YAML configuration format is identical to SQL transforms: only the source file extension changes from .sql to .py.

Python vs SQL Transforms

AspectSQL TransformPython Transform
Source file.sql.py
RuntimeDuckDB, BigQuery, Snowflake, Spark, DatabricksSpark, Databricks, EMR only
Use caseStandard queries, joins, aggregationsML, text processing, API calls
Output mechanismSELECT resultSL_THIS temporary view
YAML configurationSame formatSame format

Use SQL transforms for standard data operations. Choose Python when the logic cannot be expressed in SQL.

How to Write a Python Transform

  1. Create a Python file -- Add a .py file in the metadata/transform/<domain>/ directory. The file name determines the task name.
  2. Write the PySpark logic -- Use SparkSession to read source tables, apply transformations, and produce a result DataFrame.
  3. Register the result as SL_THIS -- Call df.createOrReplaceTempView("SL_THIS") on your result DataFrame. Starlake reads this view to materialize the output table.
  4. Add an optional YAML configuration -- Create a .sl.yml file with the same name to set write strategy, partitioning, or expectations.
  5. Run the transform -- Execute starlake transform --name <domain>.<task> with optional --options key1=value1,key2=value2 to pass parameters.

Directory Structure

metadata/transform
└── <domain>
├── <task>.py
└── <task>.sl.yml (optional)

Pass Arguments to the Python Script

Arguments specified with the --options flag are forwarded as command-line arguments:

starlake transform --name <domain>.<transform_name> --options key1=value1,key2=value2

Starlake converts this into:

<transform_name>.py --key1 value1 --key2 value2

Parse these arguments in your script using argparse or sys.argv.

Register the Output as SL_THIS

Before the script exits, it must create a temporary view named SL_THIS. Starlake reads this view and materializes it into the target table using the configured write strategy.

note

The SL_THIS temporary view is mandatory. Starlake cannot materialize the output without it.

Example: Aggregate Sales Data with PySpark

metadata/transform/analytics/sales_by_category.py
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()

# Read source tables
products = spark.table("starbake.products")
order_lines = spark.table("starbake.order_lines")

# Aggregate revenue by product category
result = (
order_lines
.join(products, "product_id")
.groupBy("category")
.agg(
F.sum(F.col("quantity") * F.col("sale_price")).alias("total_revenue"),
F.sum("quantity").alias("total_units_sold")
)
)

# Register as SL_THIS for Starlake to materialize
result.createOrReplaceTempView("SL_THIS")

Schema Matching

The same rules apply as for SQL transforms:

  • If the target table exists, the DataFrame schema must match.
  • If the target table does not exist, Starlake infers the schema from the DataFrame.

Next Steps

Frequently Asked Questions

When should you use a Python transform instead of SQL in Starlake?

Python transforms are useful for complex operations that are difficult to express in SQL, such as machine learning, text transformations, or API calls. They require a Spark or Databricks runtime.

How do you pass parameters to a Python script in Starlake?

Use the --options key1=value1,key2=value2 flag in the starlake transform command. The parameters are passed as command-line arguments to the Python script.

What is SL_THIS in a Starlake Python transform?

SL_THIS is the name of the temporary view that the Python script must create before returning. Starlake reads this view to materialize the result into the target table.

Do Starlake Python transforms work with DuckDB?

No. Python transforms require a Spark runtime (standalone, Databricks, or EMR). For DuckDB, use SQL transforms.

Can you use the same YAML file for a Python and SQL transform?

Yes. The YAML format is identical. Only the source file changes: .sql for SQL, .py for Python.

Does the returned DataFrame need a specific schema?

The DataFrame must match the target table schema if it exists. Otherwise, Starlake infers the schema. The DataFrame must be registered as the SL_THIS temporary view.