Parallel Data Extraction
Starlake parallel extraction splits a table read into multiple concurrent JDBC queries to reduce extraction time. It divides the data range based on the partitionColumn values and executes numPartitions simultaneous reads. Parallel extraction can be combined with incremental mode. It requires a database that supports concurrent connections.
Prerequisites for Parallel Extraction
- The database must support concurrent read-write connections. Databases like SQLite do not support this and are incompatible with parallel extraction. PostgreSQL, MySQL, Oracle, SQL Server, and DuckDB support it.
- The
partitionColumnmust be defined. It should be a monotonically increasing numeric column (auto-increment ID or sequence). Timestamps can also work. - Setting
numPartitionsto a value greater than 1 activates parallelism. The default is 1 (no parallelism).
Parallel extraction opens numPartitions simultaneous JDBC connections to the source database. Ensure the database connection pool and server can handle the additional load.
How Starlake Partitions the Data
Starlake queries the minimum and maximum values of the partitionColumn, then divides the range into numPartitions equal segments. Each segment is read by a separate JDBC connection in parallel. The results are combined into the output.
YAML Configuration for Parallel Extraction
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "order" # table names or "*" to extract all tables
fullExport: true # (optional) set to false to use incremental extraction
partitionColumn: "order_id" # (optional) column to use for partitioning
numPartitions: 4 # Level of parallelism (defaults to 1 aka no parallelism)
...
Both partitionColumn and numPartitions must be set together to enable parallel extraction.
Combining Parallel and Incremental Extraction
When fullExport: false and numPartitions > 1 are configured on the same table entry, the same partitionColumn serves two purposes:
- Parallel partitioning -- splitting the data range into concurrent reads.
- Incremental tracking -- recording the last exported value in
SL_LAST_EXPORT.
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "order" # table names or "*" to extract all tables
fullExport: false # (optional) set to false to use incremental extraction
partitionColumn: "order_id" # (optional) column to use for partitioning
numPartitions: 4 # Level of parallelism (defaults to 1 aka no parallelism)
...
See the Incremental Extraction page for more details on incremental mode.
Choosing the Number of Partitions
The optimal numPartitions value depends on:
- Source database capacity -- more partitions means more simultaneous connections.
- Table size -- small tables benefit less from parallelism.
- Network bandwidth -- parallel reads increase throughput only if the network is not the bottleneck.
Start with 4 partitions and adjust based on observed performance and database load.
Frequently Asked Questions
How does parallel extraction work in Starlake?
Parallel extraction divides the reading of a table into multiple partitions executed simultaneously. The column specified in partitionColumn is used to split the data, and numPartitions defines the number of concurrent reads.
Is the numPartitions parameter required?
No. The default value is 1 (no parallelism). Increasing this value activates parallel extraction.
What are the prerequisites for parallel extraction?
The database must support concurrent read-write connections. The partitionColumn parameter must be defined. Databases like SQLite are incompatible with parallel extraction.
Can I use parallel and incremental extraction at the same time?
Yes. When fullExport: false and numPartitions > 1, the same partitionColumn is used for both parallel partitioning and incremental tracking.
What type of column should I use for partitionColumn in parallel mode?
A monotonically increasing numeric column (auto-increment ID, sequence) is recommended. Timestamps can also work.
How does Starlake determine the partition boundaries?
Starlake queries the minimum and maximum values of the partitionColumn, then divides the range into numPartitions equal segments. Each segment is read by a separate JDBC connection.