Skip to main content

Autoload: Zero-Config File Loading into Your Data Warehouse

Starlake autoload detects CSV, JSON, and XML file formats, infers target table schemas, and loads data into BigQuery, Snowflake, Databricks, DuckDB, or any supported warehouse -- all in a single command with zero configuration. It uses file naming conventions and directory structure to determine the target domain and table. Autoload handles file staging, validation, and archiving automatically.

For more control over parsing (e.g., multi-character delimiters or non-standard directories), use the manual load workflow instead. For a hands-on walkthrough, see the load tutorial.

Domain and Schema Terminology

note

Data warehouses organize tables into schemas. Depending on the engine, a schema is called schema (Snowflake, Databricks), catalog, or dataset (BigQuery). In Starlake, the term domain designates a schema, catalog, or dataset.

The folder name under incoming/ becomes the domain name in the target warehouse.

File Naming Conventions and Directory Structure

Autoload relies on conventions to map files to warehouse tables:

  • Files are stored under the incoming directory: $SL_ROOT/incoming/<domain>/<table><suffix>.<extension>
  • The folder name becomes the target schema (domain).
  • The file name (without extension or suffix) becomes the target table name.
  • A date/time suffix triggers APPEND mode (incremental loading). Files without a suffix use OVERWRITE mode.

For example, order_20240228.json and order_line_20240228.csv load into tables order and order_line respectively, both in APPEND mode.

Directory structure before running autoload
incoming/
└── starbake/
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv

Running Autoload

Run the autoload command:

starlake autoload

Autoload performs four steps:

  1. Detects the file format from extension and content
  2. Infers the schema and generates YAML configuration files
  3. Stages files for loading
  4. Loads files into the target warehouse

Directory Structure After Autoload

After autoload runs, Starlake generates schema files and organizes data files across several directories.

Table schemas inferred from the directory structure
metadata/load/
└── starbake/
├── _config.sl.yml
├── product.sl.yml
└── order.sl.yml

datasets/
├── archive/ # Files moved here after successful loading
│ └── starbake/
├── ingesting/ # Files moved here during loading
│ └── starbake/
├── replay/ # Invalid records that need to be replayed
│ └── starbake/
├── stage/ # Files staged before loading
│ └── starbake/
├── unresolved/ # Files that do not match a valid pattern
└── incoming/
└── starbake/
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv

Generated Configuration Files

  • _config.sl.yml: Domain configuration file. Defines how the domain (schema) is created in the target database.
  • <table>.sl.yml: Table configuration file. Describes how files are parsed, validated, and loaded into the target table.

How Autoload Detects File Format (CSV, JSON, XML)

Autoload uses the following rules to detect file formats:

  • The file extension determines the base format. Supported extensions: .csv, .psv, .dsv, .json, .xml.
  • For CSV files: the first line is inspected to detect the separator and column names. The entire file content is analyzed to detect column types.
  • For JSON files: if the first character is [, the format is JSON_ARRAY. If the first character is {, the format is JSON (one object per line).
  • For XML files: the first line is inspected to detect the root element and the record tag.

Expected XML Structure

XML files must follow this structure -- a root element containing repeated child elements (rows), where each child contains sub-elements (columns) and optionally node attributes:

<?xml version="1.0"?>
<myroot>
<myrecord attr1="value1">
<column1>value1</column1>
<column2>value2</column2>
</myrecord>
</myroot>

Running Autoload with Spark Locally

When running against Spark locally, tables are created as files in the datasets/ directory. When running against a real warehouse, tables are created in the target database.

Extra files when running autoload against Spark locally
datasets/
├── starbake.db/ # Tables loaded into the starbake schema
│ ├── order/
│ ├── order_line/
│ └── product/
├── audit.db/ # Audit tables
│ ├── audit/
│ └── rejected/

Frequently Asked Questions

How does Starlake autoload detect the file format?

Autoload uses the file extension (.csv, .json, .xml, .psv, .dsv) and inspects the first line of the file to detect the separator and column count. For JSON, it checks if the file starts with [ (array) or { (object).

How does autoload decide between APPEND and OVERWRITE mode?

Files with a date/time suffix (e.g., order_20240228.csv) are loaded in APPEND mode. Files without a suffix are loaded in OVERWRITE mode.

What happens to files that do not match any expected pattern?

Files whose names do not match a valid pattern are moved to the datasets/unresolved/ directory. They are not loaded.

What is the difference between a domain and a schema in Starlake?

In Starlake, "domain" refers to what databases call a schema, catalog, or dataset (depending on the engine). The folder name under incoming/ becomes the domain name.

Where does autoload store inferred schema files?

Schema files are generated under metadata/load/<domain>/. Each table gets a <table>.sl.yml file, and the domain gets a _config.sl.yml file.

What directory structure does autoload create?

Autoload uses directories under datasets/: incoming (source), stage (pre-load), ingesting (during load), archive (after load), replay (invalid records), and unresolved (unmatched files).

Does autoload work with Spark locally?

Yes. When running locally with Spark, tables are created as files in datasets/<domain>.db/ directories, with an audit.db/ directory for audit and rejected tables.

What XML structure does autoload expect?

Autoload expects XML files with a root element containing repeated child elements (rows). Each child element contains sub-elements (columns) and optionally node attributes.