Skip to main content

Incremental Data Extraction

Starlake incremental extraction pulls only new rows from a database table by tracking the maximum value of a partition column between runs. The extraction state is persisted in the SL_LAST_EXPORT audit table. This avoids re-extracting the entire table on each run and reduces both extraction time and data volume. Configuration requires setting fullExport: false and specifying a partitionColumn in the extraction YAML.

How to Set Up Incremental Extraction

  1. Choose a partition column -- Select a monotonically increasing column (auto-increment ID, timestamp, or sequence) that is never updated after insertion.
  2. Set fullExport to false -- In the extraction YAML, set fullExport: false on the table entry to enable incremental mode.
  3. Set the partitionColumn -- Add partitionColumn with the name of the chosen column. Starlake uses it to track the last exported value.
  4. Run the first extraction -- Execute starlake extract-data. Since no prior state exists, all rows are extracted. The max value is recorded in SL_LAST_EXPORT.
  5. Run subsequent extractions -- On each subsequent run, only rows where the partition column exceeds the last exported value are fetched.

partitionColumn and fullExport Configuration

partitionColumn

The partitionColumn identifies the column used to detect new rows since the last extraction. After each run, Starlake records the maximum value of this column in the SL_LAST_EXPORT audit table and uses it as the lower bound for the next extraction. This column must be:

  • Monotonically increasing -- auto-increment ID, timestamp, or sequence number.
  • Never updated after insertion -- otherwise modified rows are missed.

Typical choices: surrogate keys (order_id), creation timestamps (created_at), or update timestamps (updated_at).

warning

Incremental mode detects only new rows. Rows modified after insertion are not re-extracted unless the partition column is an updated_at timestamp that changes on every update.

fullExport

ValueBehavior
fullExport: true (default)Extracts the entire table on every run. Use for small tables or when a complete snapshot is needed.
fullExport: falseExtracts only rows where partitionColumn > last_exported_value. Requires partitionColumn to be set.

YAML Configuration for Incremental Extraction

metadata/extract/my_extract_config.sl.yml
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "order" # table names or "*" to extract all tables
fullExport: false # Enable incremental extraction
partitionColumn: "order_id" # Column to track extraction progress
...

First Run vs. Subsequent Runs: Incremental Extraction Workflow

First extraction -- no prior state exists, so Starlake extracts all rows:

$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/
# Extracts all rows from starbake.order
# Records max(order_id) = 1000 in SL_LAST_EXPORT

Subsequent extraction -- Starlake reads the last exported value and fetches only new rows:

$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/
# Extracts only rows where order_id > 1000
# Updates max(order_id) = 1050 in SL_LAST_EXPORT

The extraction state is stored in the SL_LAST_EXPORT audit table. See the Monitoring page for details on how to inspect this table.

Combining Incremental and Parallel Extraction

When fullExport: false and numPartitions > 1 are configured on the same table entry, the same partitionColumn serves both purposes: parallel partitioning and incremental tracking. See the Parallel Extraction page for configuration details.

Frequently Asked Questions

What is the difference between fullExport true and false in Starlake?

With fullExport: true (default), Starlake extracts the entire table on every run. With fullExport: false, only rows where the partitionColumn value exceeds the last exported value are extracted.

What column types are suitable for partitionColumn?

The column must be monotonically increasing and never updated after insertion. Typical choices are: auto-increment keys (order_id), timestamps (created_at, updated_at), or sequence numbers.

Where does Starlake store the last extraction state?

The state is stored in the SL_LAST_EXPORT audit table. Starlake records the maximum value of the partitionColumn after each extraction.

What happens during the first incremental extraction?

Since no prior state exists, Starlake extracts all rows from the table. It then records the maximum partitionColumn value for subsequent runs.

Are updated rows detected in incremental mode?

No. Incremental mode detects only new rows (partitionColumn value greater than the last export). Modified rows are not re-extracted unless the partition column is an updated_at timestamp.

Can you combine incremental and parallel extraction?

Yes. When fullExport: false and numPartitions > 1 are configured, the same partitionColumn is used for both parallel partitioning and incremental tracking.