Skip to main content

Extract Data from Databases

Starlake extracts tables from any JDBC-compliant database (PostgreSQL, MySQL, Oracle, SQL Server, DuckDB, and others) and writes them as CSV files. Configuration is declarative: you define a JDBC connection in application.sl.yml and list the schemas and tables to extract in a YAML file under metadata/extract/. Extraction supports full and incremental modes, parallel reads, and wildcard table selection.

Steps to Configure and Run an Extraction

  1. Install prerequisites -- Install DuckDB and download the sample duckdb.db database into $SL_ROOT/datasets.
  2. Configure the database connection -- Edit metadata/application.sl.yml to add a JDBC connection entry with the database URL and driver class.
  3. Create the extraction YAML -- Create metadata/extract/my_extract_config.sl.yml specifying connectionRef, schema name, tables to extract, and table types.
  4. Run the extract-data command -- Execute starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/ to export tables as CSV files.
  5. Verify the output files -- Check the $SL_ROOT/incoming/starbake folder for the extracted CSV files.
  6. (Optional) Generate schema descriptions -- Run starlake extract-schema --config my_extract_config to produce table description files for warehouse loading.

Prerequisites

  • Install duckdb

  • Download the sample duckdb.db database and store it in the datasets folder in your project directory $SL_ROOT/datasets

This will bring in the following database schema:

$ duckdb $SL_ROOT/datasets/duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show;
┌──────────┬──────────┬────────────┬────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────┼────────────┼────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ duckdb │ starbake │ order │ [customer_id, order_id, status, timestamp] │ [BIGINT, BIGINT, VARCHAR, TIMESTAMP] │ false │
│ duckdb │ starbake │ order_line │ [order_id, product_id, quantity, sale_price] │ [BIGINT, BIGINT, BIGINT, DOUBLE] │ false │
│ duckdb │ starbake │ product │ [category, cost, description, name, price, product_id] │ [VARCHAR, DOUBLE, VARCHAR, VARCHAR, DOUBLE, BIGINT] │ false │
└──────────┴──────────┴────────────┴────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴───────────┘
D .quit
$

Configure the Connection to the Source Database

In the project folder, under the metadata folder, edit the file application.sl.yml and set the connection parameters to the source database.

metadata/application.sl.yml
version: 1
application:
...
connections:
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"

...

Create the Extraction Configuration

In the metadata/extract folder, create a new file my_extract_config.sl.yml. Set connectionRef to the connection defined in application.sl.yml, specify the schema to extract, and list the tables.

metadata/extract/my_extract_config.sl.yml
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "*" # table names or "*" to extract all tables
tableTypes: # (optional) table types to extract
- "TABLE"
#- "VIEW"
#- "SYSTEM TABLE"
#- "GLOBAL TEMPORARY"
#- "LOCAL TEMPORARY"
#- "ALIAS"
#- "SYNONYM"

The wildcard * extracts all tables from the specified schema. You can also list individual table names. The tableTypes field filters by object type (TABLE, VIEW, SYSTEM TABLE, GLOBAL TEMPORARY, LOCAL TEMPORARY, ALIAS, SYNONYM).

Run the extract-data Command

$ cd $SL_ROOT
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/

The $SL_ROOT/incoming/starbake folder will contain the extracted data in CSV format.

You are now ready to load the data into the data warehouse of your choice using the starlake load command.

Generate Table Schema Descriptions with extract-schema (Optional)

The extract-schema command generates a table schema description file in the metadata/load directory. This file allows you to load the data into a data warehouse using starlake load without relying on schema inference.

$ cd $SL_ROOT
$ starlake extract-schema --config my_extract_config # extract description

The generated schema files describe columns, types, and constraints. They are stored alongside the load configuration so starlake load can use them directly.

Frequently Asked Questions

What databases does Starlake support for extraction?

Starlake supports extraction from any JDBC-compliant database, including PostgreSQL, MySQL, SQL Server, Oracle, DuckDB, and more. You configure the connection using a JDBC URL and driver in the application.sl.yml file.

Can Starlake extract data incrementally?

Yes, Starlake supports incremental extraction by specifying a timestamp column in the table configuration. Only rows with a timestamp newer than the last extraction will be fetched.

What is the output format of Starlake extraction?

By default, Starlake extracts data into CSV files stored in the output directory you specify. The extracted files are organized by schema name and can be directly loaded into a data warehouse using the starlake load command.

Can I extract all tables from a schema at once?

Yes, you can use the wildcard * in the tables name field to extract all tables from a given schema. You can also filter by table types such as TABLE, VIEW, or SYSTEM TABLE.

What is the difference between extract-data and extract-schema?

extract-data exports the actual table data as CSV files. extract-schema generates table schema description files in the metadata/load directory, which are used by starlake load to load data into a data warehouse without relying on schema inference.

Where do I configure the connection to the source database?

In the metadata/application.sl.yml file, under the connections section. Each connection requires a type (e.g., jdbc), a JDBC URL, and a driver class name.

What is the CLI command to run an extraction?

starlake extract-data --config <config_name> --outputDir <output_directory>