Load CSV and Delimiter-Separated Files into Your Data Warehouse
Starlake loads CSV and delimiter-separated value (DSV) files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses. It supports configurable separators, encodings, headers, and write strategies. Schema inference auto-detects column names and types. Every record is validated against the schema -- rejected records are stored in the audit.rejected table for review.
You can use autoload for zero-config loading, or configure the manual load workflow for full control over parsing options.
Infer CSV Schema Automatically with infer-schema
Start by inferring the schema from a sample data file. The infer-schema command detects column names, types, separator, and encoding, then generates a YAML configuration file.
starlake infer-schema --input incoming/starbake
This generates a <table>.sl.yml file under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how detection works.
CSV Parsing Options: Separator, Encoding, Quote, Escape
Most of the time, infer-schema generates a working configuration. Customize the metadata section when you need to override parsing behavior.
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
...
attributes:
- ...
Available Parsing Options
| Option | Description | Default |
|---|---|---|
encoding | File encoding (e.g., UTF-8, ISO-8859-1, Windows-1252) | UTF-8 |
withHeader | Whether the file contains a header row | true |
separator | Column delimiter character | ; |
quote | Quote character for wrapping field values | " |
escape | Escape character inside quoted fields | \ |
filter | Condition/regex to filter lines. Matching lines are loaded; others are discarded | None |
ack | Load the file only if a file with the same name and this extension exists in the same directory | None |
options | Map of options passed to the parser. See the Apache Spark CSV documentation for all available options | None |
emptyIsNull | Treat empty column values as NULL | false |
fillWithDefaultValue | Fill NULL values with the column's default value | false |
Column Validation: Types, Required Fields, Transforms
The attributes section defines column-level validation rules. Each attribute maps to a column in the source file and the target table.
table:
pattern: "order_line.*.csv"
metadata:
- ...
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"
Available Attribute Properties
| Property | Description | Default |
|---|---|---|
name | Column name from the file header. If no header, refers to the column index (0-based). Also used as the target table column name unless rename is set. | (required) |
type | Column type: string, int, long, double, float, boolean, or any date/timestamp variant defined in metadata/types/default.sl.yml. You can add custom types in the metadata/types/ directory. | string |
required | Whether the column must have a non-null value | false |
privacy | Privacy transformation to apply (e.g., masking, hashing) | None |
script | SQL transformation to apply to the column value | None |
comment | Column comment | None |
rename | Target column name in the database (overrides name) | name |
default | Default value when the column is NULL | None |
trim | Trim strategy: None, Left, Right, Both | None |
To add, replace, or ignore attributes during loading, see the Transform on load documentation.
Complete CSV Load Configuration Example
The configuration below describes a complete setup for loading order_line CSV files with APPEND write strategy:
table:
pattern: "order_line.*.csv"
metadata:
format: "DSV"
withHeader: true
separator: ";"
writeStrategy:
type: "APPEND"
attributes:
- name: "order_id"
type: "int"
- name: "product_id"
type: "int"
- name: "quantity"
type: "int"
- name: "sale_price"
type: "double"
The writeStrategy.type controls how data is written to the target table. Available strategies: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.
Frequently Asked Questions
What delimiters does Starlake support for CSV files?
Starlake supports any single-character delimiter through the separator property in the metadata configuration. Common delimiters include semicolons (;), commas (,), pipes (|), and tabs. The default separator is semicolon (;). For multi-character delimiters, use the manual load configuration.
What encoding does Starlake use for CSV files?
The default encoding is UTF-8. Change it by setting the encoding property in the metadata section. Any Java-supported encoding is accepted (e.g., ISO-8859-1, Windows-1252).
How do I load a CSV file without a header row?
Set withHeader: false in the metadata section. Column names in the attributes section then refer to column indices (0-based) instead of header names.
How does Starlake infer the schema of a CSV file?
Run starlake infer-schema --input incoming/<domain>. This generates a .sl.yml YAML file with detected column names, types, and parsing options.
What write strategies are available for CSV loading?
Starlake supports APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.
Can I rename columns during CSV loading?
Yes. Use the rename property on any attribute to set a different column name in the target table.
How does Starlake handle empty values in CSV files?
By default, empty values are kept as empty strings. Set emptyIsNull: true in the metadata section to treat empty values as NULL.
What happens to rows that fail validation?
Rows that do not match the defined schema (wrong type, missing required field) are sent to the audit.rejected table with the rejection reason.
Can I apply transformations during CSV loading?
Yes. Use the script and privacy properties on individual attributes, or see the Transform on load documentation for adding, replacing, or ignoring attributes.