Extract Data Using Custom SQL Queries

Q: How do you specify the SQL query in the YAML configuration?

In the 'tables' section, add a 'sql' field with the full SQL query. The 'name' field serves as the logical name for the output file.

Q: Can you mix table extraction and query extraction in the same config file?

Yes. Each entry in 'tables' can have or omit a 'sql' field. Entries without 'sql' extract the entire table.

Q: Can you extract the schema at the same time as data?

Yes. Schema extraction (extract-schema) tries to retrieve remarks based on the table name if they exist. Both commands (extract-data and extract-schema) use the same configuration file.

Starlake query extraction exports data from a JDBC database using custom SQL statements instead of extracting entire tables. You define a sql field on each table entry to apply filters, joins, aggregations, or column selections. Both extract-data and extract-schema use the same YAML configuration file. This feature is available for JDBC connections only.

info

Query-based extraction does not work with non-JDBC connections (filesystem, etc.). It also does not infer primary keys or foreign keys. All attributes from query results are treated as not required.

Query Extraction YAML Configuration

Add a sql field to any table entry in the extraction YAML. The name field serves as the logical name for the output file.

metadata/extract/my_extract_config.sl.yml
version: 1
extract:
  connectionRef: "duckdb" # The database connection to use
  jdbcSchemas:
    - schema: "starbake"
      tables:
        - name: "order"               # table names or  "*" to extract all tables
          sql: "select * from starbake.order"
  ...

You can mix table-based and query-based entries in the same configuration. Entries without a sql field extract the entire table. Entries with a sql field execute the provided query.

Full Example: JOIN and Aggregation in an Extraction Query

The following example extracts a subset of orders joined with order lines, computing a total amount per order. It also extracts a filtered product catalog.

metadata/extract/custom_query_extract.sl.yml
version: 1
extract:
  connectionRef: "duckdb"
  jdbcSchemas:
    - schema: "starbake"
      tables:
        - name: "recent_orders"
          sql: |
            SELECT
              o.order_id,
              o.customer_id,
              o.status,
              o.timestamp,
              SUM(ol.quantity * ol.sale_price) AS total_amount
            FROM starbake.order o
            JOIN starbake.order_line ol ON o.order_id = ol.order_id
            WHERE o.timestamp >= '2024-01-01'
            GROUP BY o.order_id, o.customer_id, o.status, o.timestamp
        - name: "product_catalog"
          sql: "SELECT product_id, name, category, price FROM starbake.product WHERE price > 0"

Key points:

name defines the output file name (e.g., recent_orders.csv).
sql accepts any valid SQL for the target database: JOINs, WHERE clauses, GROUP BY, aggregations.
The schema field identifies the source schema for metadata purposes.

Run extract-data and extract-schema Commands

Extract the data:

$ cd $SL_ROOT
$ starlake extract-data --config custom_query_extract --outputDir $SL_ROOT/incoming/

Optionally, generate a schema description file for warehouse loading:

$ starlake extract-schema --config custom_query_extract

Schema extraction attempts to retrieve table remarks (descriptions) based on the table name when they exist. Both commands use the same configuration file.

Frequently Asked Questions

What is the difference between table extraction and query extraction in Starlake?

Table extraction extracts all columns and rows from a table. Query extraction uses the sql field to execute a custom SQL query: filters, joins, aggregations, and column selection.

Does query extraction infer primary keys and foreign keys?

No. Query extraction does not infer primary keys or foreign keys. All attributes are considered not required.

Which connection types support query-based extraction?

Query-based extraction is available only on JDBC connections.

How do you specify the SQL query in the YAML configuration?

In the tables section, add a sql field with the full SQL query. The name field serves as the logical name for the output file.

Can you mix table extraction and query extraction in the same config file?

Yes. Each entry in tables can have or omit a sql field. Entries without sql extract the entire table.

Can you extract the schema at the same time as data?

Yes. Schema extraction (extract-schema) tries to retrieve remarks based on the table name if they exist. Both commands (extract-data and extract-schema) use the same configuration file.

Query Extraction YAML Configuration​

Full Example: JOIN and Aggregation in an Extraction Query​

Run extract-data and extract-schema Commands​

Frequently Asked Questions​

What is the difference between table extraction and query extraction in Starlake?​

Does query extraction infer primary keys and foreign keys?​

Which connection types support query-based extraction?​

How do you specify the SQL query in the YAML configuration?​

Can you mix table extraction and query extraction in the same config file?​

Can you extract the schema at the same time as data?​