📄️ Load CSV, JSON and XML Files into Your Data Warehouse
Step-by-step tutorial: load and validate CSV, JSON, and XML files into BigQuery, Snowflake, Databricks or DuckDB using Starlake autoload. Includes schema inference, data validation and write strategies.
📄️ Autoload — Zero-Config Data Loading
Starlake autoload detects CSV, JSON, and XML file formats, infers schemas, and loads data into your warehouse with zero configuration. Learn naming conventions, directory structure, and format detection rules.
📄️ Manual Load Configuration
Configure Starlake manual load for multi-character delimiters, custom folder layouts, and per-environment incoming paths. Define domain and table settings in YAML.
📄️ Load CSV and DSV Files
Load CSV and delimiter-separated files into BigQuery, Snowflake, Databricks or DuckDB with Starlake. Configure separators, encoding, headers, column validation, and write strategies in YAML.
📄️ Load JSON Files
Load JSON files with nested and repeated attributes into BigQuery, Snowflake, or Databricks using Starlake. Supports JSON, JSON_FLAT, and JSON_ARRAY formats with automatic schema inference.
📄️ Load XML Files
Load XML files into BigQuery, Snowflake, or Databricks using Starlake. Configure rowTag, attribute prefixes, XSD validation, and nested element mapping in YAML.
📄️ Load Fixed-Width Files
Load fixed-width positional files into BigQuery, Snowflake, Databricks, or DuckDB using Starlake. Define column positions in YAML, infer schemas automatically, and validate data during loading.
📄️ Load Strategies — File Ordering
Configure file processing order in Starlake with built-in load strategies (chronological or alphabetical) or implement a custom Scala strategy. Set loadStrategyClass in application.sl.yml.
📄️ Write Strategies in Starlake: APPEND, OVERWRITE, UPSERT, SCD2, DELETE_THEN_INSERT
Configure write strategies in Starlake: APPEND, OVERWRITE, UPSERT_BY_KEY, SCD2, DELETE_THEN_INSERT and ADAPTATIVE. Declarative YAML config for BigQuery, Snowflake and Databricks.
📄️ Clustering and Partitioning in Starlake — BigQuery, Databricks, Spark
Configure clustering, partitioning, materialized views and expiration in Starlake for BigQuery and Spark/Databricks. YAML-based sink configuration for query optimization.
📄️ Native Load Mode — Skip Spark Validation in Starlake
Bypass Spark validation and load files directly into BigQuery, Snowflake or Databricks using Starlake native load mode. Configure at table, domain or project level with YAML.
📄️ Type Validation in Starlake — Built-in and Custom Types with Regex
Validate data types during ingestion with Starlake. Define schemas with built-in and custom types using regex patterns, with SQL-mapped primitive types. Rejected records go to an audit table.
📄️ Transform on Load — Computed Columns, Ignore and Foreign Keys in Starlake
Add computed columns, ignore sensitive fields (GDPR) and define foreign keys during data ingestion in Starlake. Use Spark SQL functions and file metadata to enrich records.
📄️ Access Control in Starlake — Table, Row and Column Level Security for BigQuery and Databricks
Define table ACL, row-level security (RLS) with SQL predicates, and column-level access policies (PII) in Starlake. YAML configuration for BigQuery and Databricks.
📄️ Data Quality Expectations — Post-Load Assertions in Starlake
Define post-load data quality expectations in Starlake using Jinja SQL macros. Assert uniqueness, row counts and custom conditions on target tables with failOnError support.
📄️ Orchestrate Load Jobs with Starlake — Airflow, Dagster, Snowflake Tasks
Generate orchestration DAGs automatically for Starlake load jobs. Supports Airflow, Dagster and Snowflake Tasks. Run starlake dag-generate to analyze dependencies and produce DAG files.
📄️ Ingestion Metrics in Starlake — Continuous and Discrete Data Profiling
Compute continuous and discrete metrics during data ingestion with Starlake. Track mean, variance, percentiles, skewness, kurtosis, category frequency and more. Compare metrics across loads.