Architecture
1. Declarative Pipeline Model
Starlake is a declarative data platform. All pipeline behavior is defined in YAML configurations and SQL queries — no imperative code is required. The same definitions work across multiple engines (BigQuery, Snowflake, Spark, DuckDB, any JDBC database) with automatic SQL dialect transpilation.
2. Pipeline Stages
Data flows through four stages:
EXTRACT ──→ LOAD ──→ TRANSFORM ──→ ORCHESTRATE
| Stage | Input | Output | Purpose |
|---|---|---|---|
| Extract | JDBC databases | Files (CSV, JSON, Parquet) | Pull data from external sources |
| Load | Files (CSV, JSON, XML, Parquet, fixed-width) | Validated warehouse tables | Ingest, validate, and write to target tables |
| Transform | Warehouse tables | Derived tables, views, or files | SQL/Python analytics and KPI computation |
| Orchestrate | YAML + SQL dependencies | DAGs (Airflow, Dagster, Snowflake Tasks) | Schedule and coordinate execution |
3. Project Directory Structure
Every Starlake project follows a standardized layout:
$SL_ROOT/
├── metadata/ # Pipeline definitions (source of truth)
│ ├── application.sl.yml # Global config: connections, engines, audit
│ ├── env.sl.yml # Environment variables
│ ├── env.SNOWFLAKE.sl.yml # Engine-specific overrides
│ ├── types/ # Data type definitions (regex-based)
│ │ └── default.sl.yml
│ ├── load/ # Ingestion definitions
│ │ └── {domain}/
│ │ ├── _config.sl.yml # Domain-level defaults
│ │ └── {table}.sl.yml # Table schema + validation rules
│ ├── extract/ # JDBC extraction definitions
│ │ └── {connection}.sl.yml
│ ├── transform/ # SQL/Python transformation definitions
│ │ └── {domain}/
│ │ ├── _config.sl.yml
│ │ ├── {task}.sql
│ │ └── {task}.sl.yml # Optional task config
│ ├── expectations/ # Data quality macros (.j2 templates)
│ ├── dags/ # DAG templates for orchestrators
│ └── tests/ # Unit test definitions
│ ├── load/
│ └── transform/