Skip to main content

Starlake Glossary

A reference of key concepts and terminology used throughout Starlake and its documentation.


Declarative Data Engineering

A methodology where data pipelines are defined by describing the desired outcome in YAML and SQL (the "what"), rather than writing imperative code (the "how"). Starlake implements this paradigm, acting as "Terraform for Data Pipelines" -- you declare your schemas, transforms, and orchestration, and Starlake generates the execution code automatically.

Learn more in the Overview

Domain

A logical grouping of related tables in Starlake, equivalent to a database schema or a business area. Domains are defined using a _config.sl.yml file within a directory and can carry shared metadata (format, write strategy, scheduling) that applies to all tables within them.

Learn more about Domains

Schema (Table Schema)

A YAML definition of a table's structure, including column names, data types, validation rules, privacy settings, and required constraints. Schemas drive how Starlake validates and loads incoming data. They are defined in .sl.yml files under the metadata/load directory.

Learn more about Schemas

Load

The process of ingesting files (CSV, TSV, JSON, XML, Fixed-width, Parquet, Avro) into a data warehouse. Starlake automatically validates data against the declared schema, applies privacy rules, and writes to the target table using the configured write strategy.

Learn more about Load

Autoload

A zero-configuration loading mode where Starlake infers schemas and orchestration logic directly from file names and directory structure. Autoload removes the need for explicit YAML schema definitions, making it the fastest way to get data into a warehouse.

Learn more about Autoload

Transform

A SQL or Python-based data transformation step that produces derived tables from previously loaded or transformed data. Transforms are defined using standard SQL SELECT statements combined with YAML configuration. Starlake automatically applies the correct merge strategy (INSERT OVERWRITE or MERGE INTO) based on the write strategy.

Learn more about Transform

Extract

The process of pulling data from any JDBC/ODBC-compliant database (PostgreSQL, MySQL, SQL Server, Oracle, DuckDB, etc.) into files. Starlake supports full and incremental extraction, parallel reads via partition columns, and automatic schema evolution detection.

Learn more about Extract

Connection

A named configuration entry in application.sl.yml that defines how to connect to a database or storage system. Connections specify the JDBC URL, driver class, and any authentication options. They are referenced by name (connectionRef) in extract and load configurations.

Learn more about Connections

Write Strategy

A configuration property (metadata.writeStrategy) that determines how data is written to the target table. Starlake supports several strategies, each suited to different use cases.

Learn more about Write Strategies

APPEND

Inserts all incoming rows into the table without modifying existing data. If the table does not exist, it is created automatically.

OVERWRITE

Replaces all existing rows in the table with the incoming data. Useful for dimension tables or complete snapshots.

UPSERT_BY_KEY

Merges incoming rows with existing data based on a key column. Rows with matching keys are updated; new rows are inserted.

UPSERT_BY_KEY_AND_TIMESTAMP

Similar to UPSERT_BY_KEY, but only updates rows when the incoming timestamp is newer than the existing one. This prevents stale data from overwriting more recent records.

DELETE_THEN_INSERT

Deletes rows in the target table for which matching keys exist in the incoming data, then inserts all incoming rows. Useful when you need a clean replacement of specific records.

SCD2

Slowly Changing Dimension Type 2 -- a write strategy that preserves the full history of record changes. Each version of a row is tracked with start (sl_start_ts) and end (sl_end_ts) timestamps, enabling point-in-time queries across the entire history of a record.

Learn more about SCD2

ADAPTATIVE

A dynamic write strategy determined at runtime based on criteria such as the day of the week, file size, or patterns in the file name. Allows switching between strategies (e.g., APPEND on weekdays, OVERWRITE on Sundays) without changing the pipeline code.

Learn more about Adaptive Write Strategy

Orchestration

Automated scheduling and dependency management of pipeline jobs. Starlake analyzes dependencies between SQL tasks and generates DAGs using predefined or custom templates for orchestrators like Apache Airflow, Dagster, or Google Cloud Composer. No orchestration code needs to be written manually.

Learn more about Orchestration

DAG

Directed Acyclic Graph -- a representation of the execution order of pipeline jobs where each node is a task and edges represent dependencies. Starlake automatically infers DAGs from SQL task dependencies and generates orchestrator-specific code (e.g., Airflow Python files) from configurable templates.

Learn more about DAGs

Expectations

Data quality assertions that are evaluated during load or transform operations. Expectations define rules that incoming data must satisfy, such as non-null checks, value ranges, uniqueness constraints, or custom SQL predicates. Rows that fail expectations can be rejected or flagged.

Learn more about Expectations

Metrics

Statistical measurements computed during data ingestion for data profiling and monitoring. Metrics can be continuous (min, max, mean, standard deviation) or discrete (count, frequency distribution). They provide visibility into data quality trends over time.

Learn more about Metrics

Native Load

A loading mode that leverages the data warehouse's built-in loading mechanism (e.g., BigQuery Load API, Snowflake COPY INTO) instead of Spark. Native Load provides optimal performance for straightforward ingestion tasks that do not require complex transformations or validations during load.

Learn more about Native Load

Lineage

Automatic tracking of data dependencies between tables and transforms. Starlake detects both column-level and table-level lineage by parsing SQL queries, without requiring manual annotation. Lineage information powers DAG generation and helps teams understand the impact of changes.

Learn more about Lineage

Environment Variables

Configuration values that vary between deployment environments (development, staging, production). Starlake supports environment-specific files (e.g., env.sl.yml) that override default settings, enabling the same pipeline code to run across different environments with different database URLs, credentials, or warehouse targets.

Learn more about Environment Variables

Bootstrap

A CLI command (starlake bootstrap) that creates a new Starlake project with the standard directory structure (metadata/, datasets/, incoming/) and default configuration files. Bootstrap is the recommended starting point for any new Starlake project.

Learn more about Bootstrap

Partition Column

A column used to split data during extraction or to organize data physically in the target warehouse. In extraction, specifying a partition column enables parallel reads from the source database, significantly improving performance. In loading, it controls how data is partitioned on disk for efficient querying.

Learn more about Partition Columns

Sink

Target table configuration that defines physical storage properties in the destination data warehouse. Sink properties include partitioning columns, clustering keys, and warehouse-specific options that optimize query performance and storage costs.

Learn more about Sink Configuration

Infer Schema

A CLI command (starlake infer-schema) that analyzes sample data files and automatically generates the corresponding YAML schema definition. Infer Schema detects column names, data types, separators, and encoding, accelerating the onboarding of new data sources.

Learn more about Infer Schema