Clustering and Partitioning

Q: How do I configure clustering in Starlake for BigQuery?

Add a sink.clustering section in the table YAML file with the list of clustering columns. BigQuery supports multiple clustering columns.

Q: How do I enable a materialized view in Starlake for BigQuery?

Set sink.materializedView: true in the table configuration. Optionally enable automatic refresh with enableRefresh: true and set the interval via refreshIntervalMs.

Q: What Spark options are available in the sink section?

The sink.options section allows passing key-value pairs to the Spark writer, for example compression: snappy. You can also define the format (delta, parquet), coalesce, and clustering/partition columns.

Q: How do I configure partition expiration in BigQuery?

Use the sink.days property to define the number of days before expiration. By default, partitions do not expire.

Clustering and partitioning reduce query cost and improve performance on large tables. Partitioning physically divides a table into segments based on a column value (typically a date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.

Starlake exposes these optimizations through the sink section of the table YAML definition. Both BigQuery and Spark/Databricks are supported, with additional options for materialized views, table expiration and Spark writer settings.

How to configure partitioning and clustering

Open the table YAML file -- Edit metadata/load/<domain>/<table>.sl.yml.
Add a sink section -- Under the table key, add a sink section with clustering and partition properties.
Define clustering columns -- List one or more columns under sink.clustering. Both BigQuery and Spark support multiple clustering columns.
Define a partition field -- Set sink.partition.field (BigQuery, single column) or sink.partition (Spark, multiple columns).
Run the load -- Execute starlake load. The target table will be created or updated with the configured partitioning and clustering.

Sink configuration by platform

Starlake adapts the sink properties to each target platform. BigQuery supports partition expiration, materialized views and partition filters. Spark supports multiple partition columns, coalesce and arbitrary writer options.

BigQuery
Spark

metadata/load/<domain>/<table>.sl.yml
table:
    ...
    sink:
      clustering:
        - <field1>
        - <field2>
      partition:
        field: <field> # only one field is allowed
      requirePartitionFilter: <true|false> # default is false
      days: <number> # expiration in days. default is "never expire"
      materializedView: <true|false> # Sink data as a materialized view. default is false
      enableRefresh: <true|false> # only if materializedView is true. Default is false
      refreshIntervalMs: <number> # only if enable refresh is true.

BigQuery-specific properties:

Property	Type	Description
`partition.field`	string	Single column for partition. BigQuery allows only one.
`requirePartitionFilter`	boolean	Force queries to include a partition filter. Default: `false`.
`days`	number	Partition expiration in days. Default: never expire.
`materializedView`	boolean	Store the table as a materialized view. Default: `false`.
`enableRefresh`	boolean	Enable automatic refresh of the materialized view. Default: `false`.
`refreshIntervalMs`	number	Refresh interval in milliseconds. Applies only when `enableRefresh` is `true`.

metadata/load/<domain>/<table>.sl.yml
table:
    ...
    sink:
        format: <format> # For example, "delta" or "parquet".
        clustering:
            - <field1>
            - <field2>
        partition:
            - <field1>
            - <field2>
        coalesce: <true|false> # default is false
        options:
            <key>: <value> # Options to pass to the Spark writer. For example, "compression" -> "snappy"

Spark-specific properties:

Property	Type	Description
`format`	string	Output format. Common values: `delta`, `parquet`.
`partition`	list	One or more partition columns. Spark supports multiple columns.
`coalesce`	boolean	Coalesce output files into fewer partitions. Default: `false`.
`options`	map	Key-value pairs passed to the Spark writer (e.g., `compression: snappy`).

Relationship with write strategies

The OVERWRITE_BY_PARTITION write strategy depends on the partition configured in the sink section. Make sure to define sink.partition before using this strategy.

Frequently Asked Questions

How do I configure clustering in Starlake for BigQuery?

Add a sink.clustering section in the table YAML file with the list of clustering columns. BigQuery supports multiple clustering columns.

What is the difference between clustering and partitioning?

Partitioning physically divides the table into segments based on a column value (e.g., date). Clustering sorts data within each partition by one or more columns, speeding up queries that filter on those columns.

How do I enable a materialized view in Starlake for BigQuery?

Set sink.materializedView: true in the table configuration. Optionally enable automatic refresh with enableRefresh: true and set the interval via refreshIntervalMs.

Can I partition on multiple columns in BigQuery via Starlake?

No. BigQuery supports only a single partition field. Starlake respects this constraint: sink.partition.field accepts only one column. With Spark (Databricks), multiple partition columns are supported.

What Spark options are available in the sink section?

The sink.options section allows passing key-value pairs to the Spark writer (e.g., compression: snappy). You can also define the format (delta, parquet), coalesce, and clustering/partition columns.

How do I configure partition expiration in BigQuery?

Use the sink.days property to define the number of days before expiration. By default, partitions do not expire.

How to configure partitioning and clustering​

Sink configuration by platform​

Relationship with write strategies​

Frequently Asked Questions​

How do I configure clustering in Starlake for BigQuery?​

What is the difference between clustering and partitioning?​

How do I enable a materialized view in Starlake for BigQuery?​

Can I partition on multiple columns in BigQuery via Starlake?​

What Spark options are available in the sink section?​

How do I configure partition expiration in BigQuery?​