Incremental Data Extraction
Starlake incremental extraction pulls only new rows from a database table by tracking the maximum value of a partition column between runs. The extraction state is persisted in the SL_LAST_EXPORT audit table. This avoids re-extracting the entire table on each run and reduces both extraction time and data volume. Configuration requires setting fullExport: false and specifying a partitionColumn in the extraction YAML.
How to Set Up Incremental Extraction
- Choose a partition column -- Select a monotonically increasing column (auto-increment ID, timestamp, or sequence) that is never updated after insertion.
- Set
fullExporttofalse-- In the extraction YAML, setfullExport: falseon the table entry to enable incremental mode. - Set the
partitionColumn-- AddpartitionColumnwith the name of the chosen column. Starlake uses it to track the last exported value. - Run the first extraction -- Execute
starlake extract-data. Since no prior state exists, all rows are extracted. The max value is recorded inSL_LAST_EXPORT. - Run subsequent extractions -- On each subsequent run, only rows where the partition column exceeds the last exported value are fetched.
partitionColumn and fullExport Configuration
partitionColumn
The partitionColumn identifies the column used to detect new rows since the last extraction. After each run, Starlake records the maximum value of this column in the SL_LAST_EXPORT audit table and uses it as the lower bound for the next extraction. This column must be:
- Monotonically increasing -- auto-increment ID, timestamp, or sequence number.
- Never updated after insertion -- otherwise modified rows are missed.
Typical choices: surrogate keys (order_id), creation timestamps (created_at), or update timestamps (updated_at).
Incremental mode detects only new rows. Rows modified after insertion are not re-extracted unless the partition column is an updated_at timestamp that changes on every update.
fullExport
| Value | Behavior |
|---|---|
fullExport: true (default) | Extracts the entire table on every run. Use for small tables or when a complete snapshot is needed. |
fullExport: false | Extracts only rows where partitionColumn > last_exported_value. Requires partitionColumn to be set. |
YAML Configuration for Incremental Extraction
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "order" # table names or "*" to extract all tables
fullExport: false # Enable incremental extraction
partitionColumn: "order_id" # Column to track extraction progress
...
First Run vs. Subsequent Runs: Incremental Extraction Workflow
First extraction -- no prior state exists, so Starlake extracts all rows:
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/
# Extracts all rows from starbake.order
# Records max(order_id) = 1000 in SL_LAST_EXPORT
Subsequent extraction -- Starlake reads the last exported value and fetches only new rows:
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/
# Extracts only rows where order_id > 1000
# Updates max(order_id) = 1050 in SL_LAST_EXPORT
The extraction state is stored in the SL_LAST_EXPORT audit table. See the Monitoring page for details on how to inspect this table.
Combining Incremental and Parallel Extraction
When fullExport: false and numPartitions > 1 are configured on the same table entry, the same partitionColumn serves both purposes: parallel partitioning and incremental tracking. See the Parallel Extraction page for configuration details.
Frequently Asked Questions
What is the difference between fullExport true and false in Starlake?
With fullExport: true (default), Starlake extracts the entire table on every run. With fullExport: false, only rows where the partitionColumn value exceeds the last exported value are extracted.
What column types are suitable for partitionColumn?
The column must be monotonically increasing and never updated after insertion. Typical choices are: auto-increment keys (order_id), timestamps (created_at, updated_at), or sequence numbers.
Where does Starlake store the last extraction state?
The state is stored in the SL_LAST_EXPORT audit table. Starlake records the maximum value of the partitionColumn after each extraction.
What happens during the first incremental extraction?
Since no prior state exists, Starlake extracts all rows from the table. It then records the maximum partitionColumn value for subsequent runs.
Are updated rows detected in incremental mode?
No. Incremental mode detects only new rows (partitionColumn value greater than the last export). Modified rows are not re-extracted unless the partition column is an updated_at timestamp.
Can you combine incremental and parallel extraction?
Yes. When fullExport: false and numPartitions > 1 are configured, the same partitionColumn is used for both parallel partitioning and incremental tracking.
Related
- Extract Tutorial -- full walkthrough for setting up your first extraction
- Load Data into Your Warehouse -- load incrementally extracted files into your warehouse
- Configure Database Connections -- set up the JDBC connection used for extraction