Skip to main content

Load CSV and Delimiter-Separated Files into Your Data Warehouse

Starlake loads CSV and delimiter-separated value (DSV) files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses. It supports configurable separators, encodings, headers, and write strategies. Schema inference auto-detects column names and types. Every record is validated against the schema -- rejected records are stored in the audit.rejected table for review.

You can use autoload for zero-config loading, or configure the manual load workflow for full control over parsing options.

Infer CSV Schema Automatically with infer-schema

Start by inferring the schema from a sample data file. The infer-schema command detects column names, types, separator, and encoding, then generates a YAML configuration file.

starlake infer-schema --input incoming/starbake

This generates a <table>.sl.yml file under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how detection works.

CSV Parsing Options: Separator, Encoding, Quote, Escape

Most of the time, infer-schema generates a working configuration. Customize the metadata section when you need to override parsing behavior.

metadata/load/<domain>/<table>.sl.yml - parsing section
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
...
attributes:
- ...

Available Parsing Options

OptionDescriptionDefault
encodingFile encoding (e.g., UTF-8, ISO-8859-1, Windows-1252)UTF-8
withHeaderWhether the file contains a header rowtrue
separatorColumn delimiter character;
quoteQuote character for wrapping field values"
escapeEscape character inside quoted fields\
filterCondition/regex to filter lines. Matching lines are loaded; others are discardedNone
ackLoad the file only if a file with the same name and this extension exists in the same directoryNone
optionsMap of options passed to the parser. See the Apache Spark CSV documentation for all available optionsNone
emptyIsNullTreat empty column values as NULLfalse
fillWithDefaultValueFill NULL values with the column's default valuefalse

Column Validation: Types, Required Fields, Transforms

The attributes section defines column-level validation rules. Each attribute maps to a column in the source file and the target table.

metadata/load/<domain>/<table>.sl.yml - attributes section
table:
pattern: "order_line.*.csv"
metadata:
- ...
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"

Available Attribute Properties

PropertyDescriptionDefault
nameColumn name from the file header. If no header, refers to the column index (0-based). Also used as the target table column name unless rename is set.(required)
typeColumn type: string, int, long, double, float, boolean, or any date/timestamp variant defined in metadata/types/default.sl.yml. You can add custom types in the metadata/types/ directory.string
requiredWhether the column must have a non-null valuefalse
privacyPrivacy transformation to apply (e.g., masking, hashing)None
scriptSQL transformation to apply to the column valueNone
commentColumn commentNone
renameTarget column name in the database (overrides name)name
defaultDefault value when the column is NULLNone
trimTrim strategy: None, Left, Right, BothNone

To add, replace, or ignore attributes during loading, see the Transform on load documentation.

Complete CSV Load Configuration Example

The configuration below describes a complete setup for loading order_line CSV files with APPEND write strategy:

metadata/load/<domain>/<table>.sl.yml
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
writeStrategy:
type: "APPEND"
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"

The writeStrategy.type controls how data is written to the target table. Available strategies: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.

Frequently Asked Questions

What delimiters does Starlake support for CSV files?

Starlake supports any single-character delimiter through the separator property in the metadata configuration. Common delimiters include semicolons (;), commas (,), pipes (|), and tabs. The default separator is semicolon (;). For multi-character delimiters, use the manual load configuration.

What encoding does Starlake use for CSV files?

The default encoding is UTF-8. Change it by setting the encoding property in the metadata section. Any Java-supported encoding is accepted (e.g., ISO-8859-1, Windows-1252).

How do I load a CSV file without a header row?

Set withHeader: false in the metadata section. Column names in the attributes section then refer to column indices (0-based) instead of header names.

How does Starlake infer the schema of a CSV file?

Run starlake infer-schema --input incoming/<domain>. This generates a .sl.yml YAML file with detected column names, types, and parsing options.

What write strategies are available for CSV loading?

Starlake supports APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.

Can I rename columns during CSV loading?

Yes. Use the rename property on any attribute to set a different column name in the target table.

How does Starlake handle empty values in CSV files?

By default, empty values are kept as empty strings. Set emptyIsNull: true in the metadata section to treat empty values as NULL.

What happens to rows that fail validation?

Rows that do not match the defined schema (wrong type, missing required field) are sent to the audit.rejected table with the rejection reason.

Can I apply transformations during CSV loading?

Yes. Use the script and privacy properties on individual attributes, or see the Transform on load documentation for adding, replacing, or ignoring attributes.