Extract Data Using Custom SQL Queries
Starlake query extraction exports data from a JDBC database using custom SQL statements instead of extracting entire tables. You define a sql field on each table entry to apply filters, joins, aggregations, or column selections. Both extract-data and extract-schema use the same YAML configuration file. This feature is available for JDBC connections only.
Query-based extraction does not work with non-JDBC connections (filesystem, etc.). It also does not infer primary keys or foreign keys. All attributes from query results are treated as not required.
Query Extraction YAML Configuration
Add a sql field to any table entry in the extraction YAML. The name field serves as the logical name for the output file.
version: 1
extract:
connectionRef: "duckdb" # The database connection to use
jdbcSchemas:
- schema: "starbake"
tables:
- name: "order" # table names or "*" to extract all tables
sql: "select * from starbake.order"
...
You can mix table-based and query-based entries in the same configuration. Entries without a sql field extract the entire table. Entries with a sql field execute the provided query.
Full Example: JOIN and Aggregation in an Extraction Query
The following example extracts a subset of orders joined with order lines, computing a total amount per order. It also extracts a filtered product catalog.
version: 1
extract:
connectionRef: "duckdb"
jdbcSchemas:
- schema: "starbake"
tables:
- name: "recent_orders"
sql: |
SELECT
o.order_id,
o.customer_id,
o.status,
o.timestamp,
SUM(ol.quantity * ol.sale_price) AS total_amount
FROM starbake.order o
JOIN starbake.order_line ol ON o.order_id = ol.order_id
WHERE o.timestamp >= '2024-01-01'
GROUP BY o.order_id, o.customer_id, o.status, o.timestamp
- name: "product_catalog"
sql: "SELECT product_id, name, category, price FROM starbake.product WHERE price > 0"
Key points:
namedefines the output file name (e.g.,recent_orders.csv).sqlaccepts any valid SQL for the target database: JOINs, WHERE clauses, GROUP BY, aggregations.- The
schemafield identifies the source schema for metadata purposes.
Run extract-data and extract-schema Commands
Extract the data:
$ cd $SL_ROOT
$ starlake extract-data --config custom_query_extract --outputDir $SL_ROOT/incoming/
Optionally, generate a schema description file for warehouse loading:
$ starlake extract-schema --config custom_query_extract
Schema extraction attempts to retrieve table remarks (descriptions) based on the table name when they exist. Both commands use the same configuration file.
Frequently Asked Questions
What is the difference between table extraction and query extraction in Starlake?
Table extraction extracts all columns and rows from a table. Query extraction uses the sql field to execute a custom SQL query: filters, joins, aggregations, and column selection.
Does query extraction infer primary keys and foreign keys?
No. Query extraction does not infer primary keys or foreign keys. All attributes are considered not required.
Which connection types support query-based extraction?
Query-based extraction is available only on JDBC connections.
How do you specify the SQL query in the YAML configuration?
In the tables section, add a sql field with the full SQL query. The name field serves as the logical name for the output file.
Can you mix table extraction and query extraction in the same config file?
Yes. Each entry in tables can have or omit a sql field. Entries without sql extract the entire table.
Can you extract the schema at the same time as data?
Yes. Schema extraction (extract-schema) tries to retrieve remarks based on the table name if they exist. Both commands (extract-data and extract-schema) use the same configuration file.