Extract Data from Databases
Starlake extracts tables from any JDBC-compliant database (PostgreSQL, MySQL, Oracle, SQL Server, DuckDB, and others) and writes them as CSV files. Configuration is declarative: you define a JDBC connection in application.sl.yml and list the schemas and tables to extract in a YAML file under metadata/extract/. Extraction supports full and incremental modes, parallel reads, and wildcard table selection.
Steps to Configure and Run an Extraction
- Install prerequisites -- Install DuckDB and download the sample
duckdb.dbdatabase into$SL_ROOT/datasets. - Configure the database connection -- Edit
metadata/application.sl.ymlto add a JDBC connection entry with the database URL and driver class. - Create the extraction YAML -- Create
metadata/extract/my_extract_config.sl.ymlspecifyingconnectionRef, schema name, tables to extract, and table types. - Run the extract-data command -- Execute
starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/to export tables as CSV files. - Verify the output files -- Check the
$SL_ROOT/incoming/starbakefolder for the extracted CSV files. - (Optional) Generate schema descriptions -- Run
starlake extract-schema --config my_extract_configto produce table description files for warehouse loading.
Prerequisites
-
Install duckdb
-
Download the sample duckdb.db database and store it in the
datasetsfolder in your project directory$SL_ROOT/datasets
This will bring in the following database schema:
$ duckdb $SL_ROOT/datasets/duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show;
┌──────────┬──────────┬────────────┬────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────┼────────────┼────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼───────────┤
│ duckdb │ starbake │ order │ [customer_id, order_id, status, timestamp] │ [BIGINT, BIGINT, VARCHAR, TIMESTAMP] │ false │
│ duckdb │ starbake │ order_line │ [order_id, product_id, quantity, sale_price] │ [BIGINT, BIGINT, BIGINT, DOUBLE] │ false │
│ duckdb │ starbake │ product │ [category, cost, description, name, price, product_id] │ [VARCHAR, DOUBLE, VARCHAR, VARCHAR, DOUBLE, BIGINT] │ false │
└──────────┴──────────┴────────────┴─────────────────────────────────────── ─────────────────┴─────────────────────────────────────────────────────┴───────────┘
D .quit
$
Configure the Connection to the Source Database
In the project folder, under the metadata folder, edit the file application.sl.yml and set the connection parameters to the source database.
version: 1
application:
...
connections:
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"
...
Create the Extraction Configuration
In the metadata/extract folder, create a new file my_extract_config.sl.yml. Set connectionRef to the connection defined in application.sl.yml, specify the schema to extract, and list the tables.
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "*" # table names or "*" to extract all tables
tableTypes: # (optional) table types to extract
- "TABLE"
#- "VIEW"
#- "SYSTEM TABLE"
#- "GLOBAL TEMPORARY"
#- "LOCAL TEMPORARY"
#- "ALIAS"
#- "SYNONYM"
The wildcard * extracts all tables from the specified schema. You can also list individual table names. The tableTypes field filters by object type (TABLE, VIEW, SYSTEM TABLE, GLOBAL TEMPORARY, LOCAL TEMPORARY, ALIAS, SYNONYM).
Run the extract-data Command
$ cd $SL_ROOT
$ starlake extract-data --config my_extract_config --outputDir $SL_ROOT/incoming/
The $SL_ROOT/incoming/starbake folder will contain the extracted data in CSV format.
You are now ready to load the data into the data warehouse of your choice using the starlake load command.
Generate Table Schema Descriptions with extract-schema (Optional)
The extract-schema command generates a table schema description file in the metadata/load directory. This file allows you to load the data into a data warehouse using starlake load without relying on schema inference.
$ cd $SL_ROOT
$ starlake extract-schema --config my_extract_config # extract description
The generated schema files describe columns, types, and constraints. They are stored alongside the load configuration so starlake load can use them directly.
Frequently Asked Questions
What databases does Starlake support for extraction?
Starlake supports extraction from any JDBC-compliant database, including PostgreSQL, MySQL, SQL Server, Oracle, DuckDB, and more. You configure the connection using a JDBC URL and driver in the application.sl.yml file.
Can Starlake extract data incrementally?
Yes, Starlake supports incremental extraction by specifying a timestamp column in the table configuration. Only rows with a timestamp newer than the last extraction will be fetched.
What is the output format of Starlake extraction?
By default, Starlake extracts data into CSV files stored in the output directory you specify. The extracted files are organized by schema name and can be directly loaded into a data warehouse using the starlake load command.
Can I extract all tables from a schema at once?
Yes, you can use the wildcard * in the tables name field to extract all tables from a given schema. You can also filter by table types such as TABLE, VIEW, or SYSTEM TABLE.
What is the difference between extract-data and extract-schema?
extract-data exports the actual table data as CSV files. extract-schema generates table schema description files in the metadata/load directory, which are used by starlake load to load data into a data warehouse without relying on schema inference.
Where do I configure the connection to the source database?
In the metadata/application.sl.yml file, under the connections section. Each connection requires a type (e.g., jdbc), a JDBC URL, and a driver class name.
What is the CLI command to run an extraction?
starlake extract-data --config <config_name> --outputDir <output_directory>
Related
- Load Data into Your Warehouse -- load extracted CSV files into BigQuery, Snowflake, Databricks or DuckDB
- Configure Database Connections -- set up JDBC connections used by the extract command
- Incremental Extraction -- extract only new rows using a partition column
- Unit Testing Data Pipelines -- test your extract-load-transform pipeline locally