Clustering and Partitioning
Clustering and partitioning reduce query cost and improve performance on large tables. Partitioning physically divides a table into segments based on a column value (typically a date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.
Starlake exposes these optimizations through the sink section of the table YAML definition. Both BigQuery and Spark/Databricks are supported, with additional options for materialized views, table expiration and Spark writer settings.
How to configure partitioning and clustering
- Open the table YAML file -- Edit
metadata/load/<domain>/<table>.sl.yml. - Add a
sinksection -- Under thetablekey, add asinksection with clustering and partition properties. - Define clustering columns -- List one or more columns under
sink.clustering. Both BigQuery and Spark support multiple clustering columns. - Define a partition field -- Set
sink.partition.field(BigQuery, single column) orsink.partition(Spark, multiple columns). - Run the load -- Execute
starlake load. The target table will be created or updated with the configured partitioning and clustering.
Sink configuration by platform
Starlake adapts the sink properties to each target platform. BigQuery supports partition expiration, materialized views and partition filters. Spark supports multiple partition columns, coalesce and arbitrary writer options.
- BigQuery
- Spark
table:
...
sink:
clustering:
- <field1>
- <field2>
partition:
field: <field> # only one field is allowed
requirePartitionFilter: <true|false> # default is false
days: <number> # expiration in days. default is "never expire"
materializedView: <true|false> # Sink data as a materialized view. default is false
enableRefresh: <true|false> # only if materializedView is true. Default is false
refreshIntervalMs: <number> # only if enable refresh is true.
BigQuery-specific properties:
| Property | Type | Description |
|---|---|---|
partition.field | string | Single column for partition. BigQuery allows only one. |
requirePartitionFilter | boolean | Force queries to include a partition filter. Default: false. |
days | number | Partition expiration in days. Default: never expire. |
materializedView | boolean | Store the table as a materialized view. Default: false. |
enableRefresh | boolean | Enable automatic refresh of the materialized view. Default: false. |
refreshIntervalMs | number | Refresh interval in milliseconds. Applies only when enableRefresh is true. |
table:
...
sink:
format: <format> # For example, "delta" or "parquet".
clustering:
- <field1>
- <field2>
partition:
- <field1>
- <field2>
coalesce: <true|false> # default is false
options:
<key>: <value> # Options to pass to the Spark writer. For example, "compression" -> "snappy"
Spark-specific properties:
| Property | Type | Description |
|---|---|---|
format | string | Output format. Common values: delta, parquet. |
partition | list | One or more partition columns. Spark supports multiple columns. |
coalesce | boolean | Coalesce output files into fewer partitions. Default: false. |
options | map | Key-value pairs passed to the Spark writer (e.g., compression: snappy). |
Relationship with write strategies
The OVERWRITE_BY_PARTITION write strategy depends on the partition configured in the sink section. Make sure to define sink.partition before using this strategy.
Frequently Asked Questions
How do I configure clustering in Starlake for BigQuery?
Add a sink.clustering section in the table YAML file with the list of clustering columns. BigQuery supports multiple clustering columns.
What is the difference between clustering and partitioning?
Partitioning physically divides the table into segments based on a column value (e.g., date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.
How do I enable a materialized view in Starlake for BigQuery?
Set sink.materializedView: true in the table configuration. Optionally enable automatic refresh with enableRefresh: true and set the interval via refreshIntervalMs.
Can I partition on multiple columns in BigQuery via Starlake?
No. BigQuery supports only a single partition field. Starlake respects this constraint: sink.partition.field accepts only one column. With Spark (Databricks), multiple partition columns are supported.
What Spark options are available in the sink section?
The sink.options section allows passing key-value pairs to the Spark writer (e.g., compression: snappy). You can also define the format (delta, parquet), coalesce, and clustering/partition columns.
How do I configure partition expiration in BigQuery?
Use the sink.days property to define the number of days before expiration. By default, partitions do not expire.