Starlake Glossary
A reference of key concepts and terminology used throughout Starlake and its documentation.
Declarative Data Engineering
A methodology where data pipelines are defined by describing the desired outcome in YAML and SQL (the "what"), rather than writing imperative code (the "how"). Starlake implements this paradigm, acting as "Terraform for Data Pipelines" -- you declare your schemas, transforms, and orchestration, and Starlake generates the execution code automatically.
Domain
A logical grouping of related tables in Starlake, equivalent to a database schema or a business area. Domains are defined using a _config.sl.yml file within a directory and can carry shared metadata (format, write strategy, scheduling) that applies to all tables within them.
Schema (Table Schema)
A YAML definition of a table's structure, including column names, data types, validation rules, privacy settings, and required constraints. Schemas drive how Starlake validates and loads incoming data. They are defined in .sl.yml files under the metadata/load directory.
Load
The process of ingesting files (CSV, TSV, JSON, XML, Fixed-width, Parquet, Avro) into a data warehouse. Starlake automatically validates data against the declared schema, applies privacy rules, and writes to the target table using the configured write strategy.
Autoload
A zero-configuration loading mode where Starlake infers schemas and orchestration logic directly from file names and directory structure. Autoload removes the need for explicit YAML schema definitions, making it the fastest way to get data into a warehouse.
Transform
A SQL or Python-based data transformation step that produces derived tables from previously loaded or transformed data. Transforms are defined using standard SQL SELECT statements combined with YAML configuration. Starlake automatically applies the correct merge strategy (INSERT OVERWRITE or MERGE INTO) based on the write strategy.
Extract
The process of pulling data from any JDBC/ODBC-compliant database (PostgreSQL, MySQL, SQL Server, Oracle, DuckDB, etc.) into files. Starlake supports full and incremental extraction, parallel reads via partition columns, and automatic schema evolution detection.
Connection
A named configuration entry in application.sl.yml that defines how to connect to a database or storage system. Connections specify the JDBC URL, driver class, and any authentication options. They are referenced by name (connectionRef) in extract and load configurations.
Write Strategy
A configuration property (metadata.writeStrategy) that determines how data is written to the target table. Starlake supports several strategies, each suited to different use cases.
Learn more about Write Strategies
APPEND
Inserts all incoming rows into the table without modifying existing data. If the table does not exist, it is created automatically.
OVERWRITE
Replaces all existing rows in the table with the incoming data. Useful for dimension tables or complete snapshots.
UPSERT_BY_KEY
Merges incoming rows with existing data based on a key column. Rows with matching keys are updated; new rows are inserted.
UPSERT_BY_KEY_AND_TIMESTAMP
Similar to UPSERT_BY_KEY, but only updates rows when the incoming timestamp is newer than the existing one. This prevents stale data from overwriting more recent records.
DELETE_THEN_INSERT
Deletes rows in the target table for which matching keys exist in the incoming data, then inserts all incoming rows. Useful when you need a clean replacement of specific records.
SCD2
Slowly Changing Dimension Type 2 -- a write strategy that preserves the full history of record changes. Each version of a row is tracked with start (sl_start_ts) and end (sl_end_ts) timestamps, enabling point-in-time queries across the entire history of a record.
ADAPTATIVE
A dynamic write strategy determined at runtime based on criteria such as the day of the week, file size, or patterns in the file name. Allows switching between strategies (e.g., APPEND on weekdays, OVERWRITE on Sundays) without changing the pipeline code.
Learn more about Adaptive Write Strategy
Orchestration
Automated scheduling and dependency management of pipeline jobs. Starlake analyzes dependencies between SQL tasks and generates DAGs using predefined or custom templates for orchestrators like Apache Airflow, Dagster, or Google Cloud Composer. No orchestration code needs to be written manually.
Learn more about Orchestration
DAG
Directed Acyclic Graph -- a representation of the execution order of pipeline jobs where each node is a task and edges represent dependencies. Starlake automatically infers DAGs from SQL task dependencies and generates orchestrator-specific code (e.g., Airflow Python files) from configurable templates.
Expectations
Data quality assertions that are evaluated during load or transform operations. Expectations define rules that incoming data must satisfy, such as non-null checks, value ranges, uniqueness constraints, or custom SQL predicates. Rows that fail expectations can be rejected or flagged.
Metrics
Statistical measurements computed during data ingestion for data profiling and monitoring. Metrics can be continuous (min, max, mean, standard deviation) or discrete (count, frequency distribution). They provide visibility into data quality trends over time.
Native Load
A loading mode that leverages the data warehouse's built-in loading mechanism (e.g., BigQuery Load API, Snowflake COPY INTO) instead of Spark. Native Load provides optimal performance for straightforward ingestion tasks that do not require complex transformations or validations during load.
Lineage
Automatic tracking of data dependencies between tables and transforms. Starlake detects both column-level and table-level lineage by parsing SQL queries, without requiring manual annotation. Lineage information powers DAG generation and helps teams understand the impact of changes.
Environment Variables
Configuration values that vary between deployment environments (development, staging, production). Starlake supports environment-specific files (e.g., env.sl.yml) that override default settings, enabling the same pipeline code to run across different environments with different database URLs, credentials, or warehouse targets.
Learn more about Environment Variables
Bootstrap
A CLI command (starlake bootstrap) that creates a new Starlake project with the standard directory structure (metadata/, datasets/, incoming/) and default configuration files. Bootstrap is the recommended starting point for any new Starlake project.
Partition Column
A column used to split data during extraction or to organize data physically in the target warehouse. In extraction, specifying a partition column enables parallel reads from the source database, significantly improving performance. In loading, it controls how data is partitioned on disk for efficient querying.
Learn more about Partition Columns
Sink
Target table configuration that defines physical storage properties in the destination data warehouse. Sink properties include partitioning columns, clustering keys, and warehouse-specific options that optimize query performance and storage costs.
Learn more about Sink Configuration
Infer Schema
A CLI command (starlake infer-schema) that analyzes sample data files and automatically generates the corresponding YAML schema definition. Infer Schema detects column names, data types, separators, and encoding, accelerating the onboarding of new data sources.