Parallel Data Extraction

Q: Can I use parallel and incremental extraction at the same time?

Yes. When fullExport: false and numPartitions > 1, the same partitionColumn is used for both parallel partitioning and incremental tracking.

Q: How does Starlake determine the partition boundaries?

Starlake queries the minimum and maximum values of the partitionColumn, then divides the range into numPartitions equal segments. Each segment is read by a separate JDBC connection.

Starlake parallel extraction splits a table read into multiple concurrent JDBC queries to reduce extraction time. It divides the data range based on the partitionColumn values and executes numPartitions simultaneous reads. Parallel extraction can be combined with incremental mode. It requires a database that supports concurrent connections.

Prerequisites for Parallel Extraction

The database must support concurrent read-write connections. Databases like SQLite do not support this and are incompatible with parallel extraction. PostgreSQL, MySQL, Oracle, SQL Server, and DuckDB support it.
The partitionColumn must be defined. It should be a monotonically increasing numeric column (auto-increment ID or sequence). Timestamps can also work.
Setting numPartitions to a value greater than 1 activates parallelism. The default is 1 (no parallelism).

warning

Parallel extraction opens numPartitions simultaneous JDBC connections to the source database. Ensure the database connection pool and server can handle the additional load.

How Starlake Partitions the Data

Starlake queries the minimum and maximum values of the partitionColumn, then divides the range into numPartitions equal segments. Each segment is read by a separate JDBC connection in parallel. The results are combined into the output.

YAML Configuration for Parallel Extraction

metadata/extract/my_extract_config.sl.yml
version: 1
extract:
  connectionRef: "duckdb" # The database connection to use
  jdbcSchemas:
    - schema: "starbake"
      tables:
        - name: "order"               # table names or  "*" to extract all tables
          fullExport: true            # (optional) set to false to use incremental extraction
          partitionColumn: "order_id" # (optional) column to use for partitioning
          numPartitions: 4            # Level of parallelism (defaults to 1 aka no parallelism)
  ...

Both partitionColumn and numPartitions must be set together to enable parallel extraction.

Combining Parallel and Incremental Extraction

When fullExport: false and numPartitions > 1 are configured on the same table entry, the same partitionColumn serves two purposes:

Parallel partitioning -- splitting the data range into concurrent reads.
Incremental tracking -- recording the last exported value in SL_LAST_EXPORT.

metadata/extract/my_extract_config.sl.yml
version: 1
extract:
  connectionRef: "duckdb" # The database connection to use
  jdbcSchemas:
    - schema: "starbake"
      tables:
        - name: "order"               # table names or  "*" to extract all tables
          fullExport: false            # (optional) set to false to use incremental extraction
          partitionColumn: "order_id" # (optional) column to use for partitioning
          numPartitions: 4            # Level of parallelism (defaults to 1 aka no parallelism)
  ...

See the Incremental Extraction page for more details on incremental mode.

Choosing the Number of Partitions

The optimal numPartitions value depends on:

Source database capacity -- more partitions means more simultaneous connections.
Table size -- small tables benefit less from parallelism.
Network bandwidth -- parallel reads increase throughput only if the network is not the bottleneck.

Start with 4 partitions and adjust based on observed performance and database load.

Frequently Asked Questions

How does parallel extraction work in Starlake?

Parallel extraction divides the reading of a table into multiple partitions executed simultaneously. The column specified in partitionColumn is used to split the data, and numPartitions defines the number of concurrent reads.

Is the numPartitions parameter required?

No. The default value is 1 (no parallelism). Increasing this value activates parallel extraction.

What are the prerequisites for parallel extraction?

The database must support concurrent read-write connections. The partitionColumn parameter must be defined. Databases like SQLite are incompatible with parallel extraction.

Can I use parallel and incremental extraction at the same time?

Yes. When fullExport: false and numPartitions > 1, the same partitionColumn is used for both parallel partitioning and incremental tracking.

What type of column should I use for partitionColumn in parallel mode?

A monotonically increasing numeric column (auto-increment ID, sequence) is recommended. Timestamps can also work.

How does Starlake determine the partition boundaries?

Starlake queries the minimum and maximum values of the partitionColumn, then divides the range into numPartitions equal segments. Each segment is read by a separate JDBC connection.

Prerequisites for Parallel Extraction​

How Starlake Partitions the Data​

YAML Configuration for Parallel Extraction​

Combining Parallel and Incremental Extraction​

Choosing the Number of Partitions​

Frequently Asked Questions​

How does parallel extraction work in Starlake?​

Is the numPartitions parameter required?​

What are the prerequisites for parallel extraction?​

Can I use parallel and incremental extraction at the same time?​

What type of column should I use for partitionColumn in parallel mode?​

How does Starlake determine the partition boundaries?​