Skip to main content

Load CSV, JSON and XML Files into Your Data Warehouse

Starlake loads and validates CSV, JSON, XML, fixed-width, Parquet, and Avro files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses. The autoload command handles format detection, schema inference, staging, and loading in a single step. For more control, the manual workflow lets you run infer-schema, stage, and load separately.

Every record is validated against the inferred or user-defined schema. Valid records go to the target table. Rejected records are stored in a dedicated audit.rejected table with the rejection reason.

Prerequisites: Prepare Your Incoming Data Files

If you skipped the extract step above, copy the folder and its files from $SL_ROOT/sample-data/ to the $SL_ROOT/incoming/ folder.

The directory structure under incoming/ determines how files map to your warehouse:

  • The folder name (starbake) becomes the database schema (called "dataset" in BigQuery, "schema" in Snowflake/Databricks).
  • Each file name (minus extension and date suffix) becomes a target table name.
  • Files suffixed by a date/time (e.g., order_20240228.csv) are loaded incrementally using the APPEND write strategy. Files without a suffix default to OVERWRITE.
Expected directory structure
incoming/
└── starbake/
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv

Autoload: Load Files in One Command

The autoload command loads all files from the default incoming folder into the warehouse. It performs four steps automatically:

  1. Detects the file type based on extension and content
  2. Infers the schema of each target table
  3. Stages the files for loading
  4. Loads the files into the target tables

Clean the metadata/load folder from any schema you generated previously, then run:

warning

autoload skips schema inference if a .sl.yml schema file already exists in metadata/load/. Delete or move existing schema files to force re-inference.

starlake autoload
note

Starlake looks at files present in the $SL_ROOT/datasets/incoming folder.

The default incoming folder can be changed by setting the SL_INCOMING environment variable or by configuring the application.area.incoming property in metadata/application.sl.yml.

Once loaded, files are moved to $SL_ROOT/datasets/archive/ if the SL_ARCHIVE environment variable is set to true (the default). To replay a load, move the files back to the incoming/ folder.

Query Loaded Data with DuckDB

After loading, query your warehouse to verify the data.

note

Install DuckDB if you haven't done so already.

Open the DuckDB shell:

$ cd $SL_ROOT
$ duckdb datasets/duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.

List all loaded tables:

D show;
┌──────────┬──────────┬─────────────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────┼──────────┼─────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ duckdb │ audit │ audit │ [JOBID, PATHS, DOM… │ [VARCHAR, VARCHAR, VARCHAR, VARCHAR, BOOLEAN, BIGINT, BIGINT, BIGINT, TIMESTAMP, INTEGER, VARCH… │ false │
│ duckdb │ audit │ rejected │ [JOBID, TIMESTAMP,… │ [VARCHAR, TIMESTAMP, VARCHAR, VARCHAR, VARCHAR, VARCHAR] │ false │
│ duckdb │ starbake │ order │ [customer_id, orde… │ [BIGINT, BIGINT, VARCHAR, TIMESTAMP] │ false │
│ duckdb │ starbake │ order_line │ [order_id, product… │ [BIGINT, BIGINT, BIGINT, DOUBLE] │ false │
│ duckdb │ starbake │ product │ [category, cost, d… │ [VARCHAR, DOUBLE, VARCHAR, VARCHAR, DOUBLE, BIGINT] │ false │
└──────────┴──────────┴─────────────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘

D select * from starbake.order limit 5;
┌─────────────┬──────────┬───────────┬─────────────────────────┐
│ customer_id │ order_id │ status │ timestamp │
│ int64 │ int64 │ varchar │ timestamp │
├─────────────┼──────────┼───────────┼─────────────────────────┤
│ 6 │ 1 │ Cancelled │ 2024-02-05 22:19:15.454 │
│ 23 │ 2 │ Delivered │ 2024-01-02 11:44:37.59 │
│ 20 │ 3 │ Delivered │ 2024-02-10 23:10:30.685 │
│ 6 │ 4 │ Delivered │ 2024-01-17 19:31:22.917 │
│ 17 │ 5 │ Pending │ 2024-01-19 01:26:06.674 │
└─────────────┴──────────┴───────────┴─────────────────────────┘

Count loaded records:

D SELECT COUNT(*) FROM starbake.order;

Check for rejected records:

D SELECT * FROM audit.rejected;

To exit the DuckDB shell, type .quit and press Enter.

D .quit
$

Manual Load: Infer Schema, Stage and Load Separately

The autoload command is a shortcut that combines three separate commands. If you already ran autoload, you can skip this section. Use the manual workflow when you need more control over schema customization before loading.

Step 1: Infer Schema

Generate YAML schema files from your data files:

starlake infer-schema --input incoming/starbake/ --clean

Your metadata/load folder now contains the starbake folder with the following files:

  • products.sl.yml
  • orders.sl.yml
  • order_lines.sl.yml
  • _config.sl.yml

The infer-schema command detects column names, types, and parsing options from the source files. In a real scenario, review and adjust these schema files before loading.

Step 2: Stage Before Loading

Starlake can stage files before loading them. This is useful when files arrive in a different folder from the one used for loading.

starlake stage

Step 3: Load Data into Your Warehouse

Run the load command. Since this tutorial targets DuckDB, set the SL_ENV variable to activate the env.DUCKDB.sl.yml configuration file:

SL_ENV=DUCKDB starlake load

Your raw files are now loaded into the warehouse. You can now start to transform your data to build insights.

Frequently Asked Questions

What file formats does Starlake support for loading?

Starlake supports loading CSV, TSV, JSON, XML, Fixed-width, Parquet, and Avro files. The file format is auto-detected based on the file extension and content when using the autoload command.

What is the difference between autoload and load in Starlake?

The autoload command is a shortcut that combines infer-schema, stage, and load into a single step. It automatically detects file types, infers schemas, and loads data. The manual load command requires you to run infer-schema and stage separately before loading.

Does Starlake validate data during loading?

Yes. Starlake validates every record against the inferred or user-defined schema. Valid records go to the target table. Rejected records are stored in the audit.rejected table with the rejection reason.

Where do files go after being loaded by Starlake?

Once loaded, files are moved to the $SL_ROOT/datasets/archive folder if the SL_ARCHIVE environment variable is set to true (the default). You can replay them by moving them back to the incoming folder.

How does Starlake map files to database schemas and tables?

The folder name under incoming/ becomes the database schema (domain). Each file name (minus extension and date suffix) becomes the target table name.

Can I load data incrementally with Starlake?

Yes. Files with a date/time suffix (e.g., order_20240228.csv) are loaded incrementally using the APPEND write strategy. Files without a suffix default to OVERWRITE.

How do I change the default incoming directory?

Set the SL_INCOMING environment variable, or configure the application.area.incoming property in metadata/application.sl.yml.

How do I review rejected records?

Query the audit.rejected table. It contains the job ID, timestamp, domain, table name, the rejected record, and the rejection reason.