Autoload: Zero-Config File Loading into Your Data Warehouse
Starlake autoload detects CSV, JSON, and XML file formats, infers target table schemas, and loads data into BigQuery, Snowflake, Databricks, DuckDB, or any supported warehouse -- all in a single command with zero configuration. It uses file naming conventions and directory structure to determine the target domain and table. Autoload handles file staging, validation, and archiving automatically.
For more control over parsing (e.g., multi-character delimiters or non-standard directories), use the manual load workflow instead. For a hands-on walkthrough, see the load tutorial.
Domain and Schema Terminology
Data warehouses organize tables into schemas. Depending on the engine, a schema is called schema (Snowflake, Databricks), catalog, or dataset (BigQuery). In Starlake, the term domain designates a schema, catalog, or dataset.
The folder name under incoming/ becomes the domain name in the target warehouse.
File Naming Conventions and Directory Structure
Autoload relies on conventions to map files to warehouse tables:
- Files are stored under the incoming directory:
$SL_ROOT/incoming/<domain>/<table><suffix>.<extension> - The folder name becomes the target schema (domain).
- The file name (without extension or suffix) becomes the target table name.
- A date/time suffix triggers APPEND mode (incremental loading). Files without a suffix use OVERWRITE mode.
For example, order_20240228.json and order_line_20240228.csv load into tables order and order_line respectively, both in APPEND mode.
incoming/
└── starbake/
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv
Running Autoload
Run the autoload command:
starlake autoload
Autoload performs four steps:
- Detects the file format from extension and content
- Infers the schema and generates YAML configuration files
- Stages files for loading
- Loads files into the target warehouse
Directory Structure After Autoload
After autoload runs, Starlake generates schema files and organizes data files across several directories.
metadata/load/
└── starbake/
├── _config.sl.yml
├── product.sl.yml
└── order.sl.yml
datasets/
├── archive/ # Files moved here after successful loading
│ └── starbake/
├── ingesting/ # Files moved here during loading
│ └── starbake/
├── replay/ # Invalid records that need to be replayed
│ └── starbake/
├── stage/ # Files staged before loading
│ └── starbake/
├── unresolved/ # Files that do not match a valid pattern
└── incoming/
└── starbake/
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv
Generated Configuration Files
_config.sl.yml: Domain configuration file. Defines how the domain (schema) is created in the target database.<table>.sl.yml: Table configuration file. Describes how files are parsed, validated, and loaded into the target table.
How Autoload Detects File Format (CSV, JSON, XML)
Autoload uses the following rules to detect file formats:
- The file extension determines the base format. Supported extensions:
.csv,.psv,.dsv,.json,.xml. - For CSV files: the first line is inspected to detect the separator and column names. The entire file content is analyzed to detect column types.
- For JSON files: if the first character is
[, the format isJSON_ARRAY. If the first character is{, the format isJSON(one object per line). - For XML files: the first line is inspected to detect the root element and the record tag.
Expected XML Structure
XML files must follow this structure -- a root element containing repeated child elements (rows), where each child contains sub-elements (columns) and optionally node attributes:
<?xml version="1.0"?>
<myroot>
<myrecord attr1="value1">
<column1>value1</column1>
<column2>value2</column2>
</myrecord>
</myroot>
Running Autoload with Spark Locally
When running against Spark locally, tables are created as files in the datasets/ directory. When running against a real warehouse, tables are created in the target database.
datasets/
├── starbake.db/ # Tables loaded into the starbake schema