Load | Starlake

📄️ Load CSV, JSON and XML Files into Your Data Warehouse

Step-by-step tutorial: load and validate CSV, JSON, and XML files into BigQuery, Snowflake, Databricks or DuckDB using Starlake autoload. Includes schema inference, data validation and write strategies.

📄️ Autoload — Zero-Config Data Loading

Starlake autoload detects CSV, JSON, and XML file formats, infers schemas, and loads data into your warehouse with zero configuration. Learn naming conventions, directory structure, and format detection rules.

📄️ Manual Load Configuration

Configure Starlake manual load for multi-character delimiters, custom folder layouts, and per-environment incoming paths. Define domain and table settings in YAML.

📄️ Load CSV and DSV Files

Load CSV and delimiter-separated files into BigQuery, Snowflake, Databricks or DuckDB with Starlake. Configure separators, encoding, headers, column validation, and write strategies in YAML.

📄️ Load JSON Files

Load JSON files with nested and repeated attributes into BigQuery, Snowflake, or Databricks using Starlake. Supports JSON, JSON_FLAT, and JSON_ARRAY formats with automatic schema inference.

📄️ Load XML Files

Load XML files into BigQuery, Snowflake, or Databricks using Starlake. Configure rowTag, attribute prefixes, XSD validation, and nested element mapping in YAML.

📄️ Load Fixed-Width Files

Load fixed-width positional files into BigQuery, Snowflake, Databricks, or DuckDB using Starlake. Define column positions in YAML, infer schemas automatically, and validate data during loading.

📄️ Load Strategies — File Ordering

Configure file processing order in Starlake with built-in load strategies (chronological or alphabetical) or implement a custom Scala strategy. Set loadStrategyClass in application.sl.yml.

📄️ Write Strategies in Starlake: APPEND, OVERWRITE, UPSERT, SCD2, DELETE_THEN_INSERT

Configure write strategies in Starlake: APPEND, OVERWRITE, UPSERT_BY_KEY, SCD2, DELETE_THEN_INSERT and ADAPTATIVE. Declarative YAML config for BigQuery, Snowflake and Databricks.

📄️ Change Data Capture (CDC) with Starlake

Implement end-to-end CDC pipelines with Starlake. Capture changes from databases via Debezium/Kafka, incremental extraction, or file-based CDC and apply them using UPSERT, SCD2, and DELETE_THEN_INSERT write strategies.

📄️ Clustering and Partitioning in Starlake — BigQuery, Databricks, Spark

Configure clustering, partitioning, materialized views and expiration in Starlake for BigQuery and Spark/Databricks. YAML-based sink configuration for query optimization.

📄️ Native Load Mode — Skip Spark Validation in Starlake

Bypass Spark validation and load files directly into BigQuery, Snowflake or Databricks using Starlake native load mode. Configure at table, domain or project level with YAML.

📄️ Type Validation in Starlake — Built-in and Custom Types with Regex

Validate data types during ingestion with Starlake. Define schemas with built-in and custom types using regex patterns, with SQL-mapped primitive types. Rejected records go to an audit table.

📄️ Transform on Load — Computed Columns, Ignore and Foreign Keys in Starlake

Add computed columns, ignore sensitive fields (GDPR) and define foreign keys during data ingestion in Starlake. Use Spark SQL functions and file metadata to enrich records.

📄️ Data Quality Expectations — Post-Load Assertions in Starlake

Define post-load data quality expectations in Starlake using Jinja SQL macros. Assert uniqueness, row counts and custom conditions on target tables with failOnError support.

📄️ Orchestrate Load Jobs with Starlake — Airflow, Dagster, Snowflake Tasks

Generate orchestration DAGs automatically for Starlake load jobs. Supports Airflow, Dagster and Snowflake Tasks. Run starlake dag-generate to analyze dependencies and produce DAG files.

📄️ Ingestion Metrics in Starlake — Continuous and Discrete Data Profiling

Compute continuous and discrete metrics during data ingestion with Starlake. Track mean, variance, percentiles, skewness, kurtosis, category frequency and more. Compare metrics across loads.