Skip to main content

Clustering and Partitioning

Clustering and partitioning reduce query cost and improve performance on large tables. Partitioning physically divides a table into segments based on a column value (typically a date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.

Starlake exposes these optimizations through the sink section of the table YAML definition. Both BigQuery and Spark/Databricks are supported, with additional options for materialized views, table expiration and Spark writer settings.

How to configure partitioning and clustering

  1. Open the table YAML file -- Edit metadata/load/<domain>/<table>.sl.yml.
  2. Add a sink section -- Under the table key, add a sink section with clustering and partition properties.
  3. Define clustering columns -- List one or more columns under sink.clustering. Both BigQuery and Spark support multiple clustering columns.
  4. Define a partition field -- Set sink.partition.field (BigQuery, single column) or sink.partition (Spark, multiple columns).
  5. Run the load -- Execute starlake load. The target table will be created or updated with the configured partitioning and clustering.

Sink configuration by platform

Starlake adapts the sink properties to each target platform. BigQuery supports partition expiration, materialized views and partition filters. Spark supports multiple partition columns, coalesce and arbitrary writer options.

metadata/load/<domain>/<table>.sl.yml
table:
...
sink:
clustering:
- <field1>
- <field2>
partition:
field: <field> # only one field is allowed
requirePartitionFilter: <true|false> # default is false
days: <number> # expiration in days. default is "never expire"
materializedView: <true|false> # Sink data as a materialized view. default is false
enableRefresh: <true|false> # only if materializedView is true. Default is false
refreshIntervalMs: <number> # only if enable refresh is true.

BigQuery-specific properties:

PropertyTypeDescription
partition.fieldstringSingle column for partition. BigQuery allows only one.
requirePartitionFilterbooleanForce queries to include a partition filter. Default: false.
daysnumberPartition expiration in days. Default: never expire.
materializedViewbooleanStore the table as a materialized view. Default: false.
enableRefreshbooleanEnable automatic refresh of the materialized view. Default: false.
refreshIntervalMsnumberRefresh interval in milliseconds. Applies only when enableRefresh is true.

Relationship with write strategies

The OVERWRITE_BY_PARTITION write strategy depends on the partition configured in the sink section. Make sure to define sink.partition before using this strategy.

Frequently Asked Questions

How do I configure clustering in Starlake for BigQuery?

Add a sink.clustering section in the table YAML file with the list of clustering columns. BigQuery supports multiple clustering columns.

What is the difference between clustering and partitioning?

Partitioning physically divides the table into segments based on a column value (e.g., date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.

How do I enable a materialized view in Starlake for BigQuery?

Set sink.materializedView: true in the table configuration. Optionally enable automatic refresh with enableRefresh: true and set the interval via refreshIntervalMs.

Can I partition on multiple columns in BigQuery via Starlake?

No. BigQuery supports only a single partition field. Starlake respects this constraint: sink.partition.field accepts only one column. With Spark (Databricks), multiple partition columns are supported.

What Spark options are available in the sink section?

The sink.options section allows passing key-value pairs to the Spark writer (e.g., compression: snappy). You can also define the format (delta, parquet), coalesce, and clustering/partition columns.

How do I configure partition expiration in BigQuery?

Use the sink.days property to define the number of days before expiration. By default, partitions do not expire.