Does Starlake require CSV files to have headers?

No, headers are optional. By default, Starlake expects headers (withHeader: true). If your file has no header, set 'withHeader' to false and the attribute names in the schema will refer to column indices instead.

Load CSV and Delimiter-Separated Files into Your Data Warehouse

Q: What encoding does Starlake use for CSV files?

The default encoding is UTF-8. You can change it by setting the 'encoding' property in the metadata section of the table configuration file. Any encoding supported by Java can be used (e.g., ISO-8859-1, Windows-1252).

Q: How does Starlake infer the schema of a CSV file?

Use the 'starlake infer-schema' command to automatically detect the schema from a sample data file. This generates a YAML configuration file with detected column names, types, and parsing options that you can then customize.

Starlake loads CSV and delimiter-separated value (DSV) files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses. It supports configurable separators, encodings, headers, and write strategies. Schema inference auto-detects column names and types. Every record is validated against the schema -- rejected records are stored in the audit.rejected table for review.

You can use autoload for zero-config loading, or configure the manual load workflow for full control over parsing options.

Infer CSV Schema Automatically with infer-schema

Start by inferring the schema from a sample data file. The infer-schema command detects column names, types, separator, and encoding, then generates a YAML configuration file.

starlake infer-schema --input incoming/starbake

This generates a <table>.sl.yml file under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how detection works.

CSV Parsing Options: Separator, Encoding, Quote, Escape

Most of the time, infer-schema generates a working configuration. Customize the metadata section when you need to override parsing behavior.

metadata/load/<domain>/<table>.sl.yml - parsing section
table:
  pattern: "order_line.*.csv"
  metadata:
    format: "DSV"
    withHeader: true
    separator: ";"
    ...
  attributes:
    - ...

Available Parsing Options

Option	Description	Default
`encoding`	File encoding (e.g., `UTF-8`, `ISO-8859-1`, `Windows-1252`)	`UTF-8`
`withHeader`	Whether the file contains a header row	`true`
`separator`	Column delimiter character	`;`
`quote`	Quote character for wrapping field values	`"`
`escape`	Escape character inside quoted fields	`\`
`filter`	Condition/regex to filter lines. Matching lines are loaded; others are discarded	None
`ack`	Load the file only if a file with the same name and this extension exists in the same directory	None
`options`	Map of options passed to the parser. See the Apache Spark CSV documentation for all available options	None
`emptyIsNull`	Treat empty column values as NULL	`false`
`fillWithDefaultValue`	Fill NULL values with the column's default value	`false`

Column Validation: Types, Required Fields, Transforms

The attributes section defines column-level validation rules. Each attribute maps to a column in the source file and the target table.

metadata/load/<domain>/<table>.sl.yml - attributes section
table:
  pattern: "order_line.*.csv"
  metadata:
    - ...
  attributes:
  - name: "order_id"
    type: "int"
  - name: "product_id"
    type: "int"
  - name: "quantity"
    type: "int"
  - name: "sale_price"
    type: "double"

Available Attribute Properties

Property	Description	Default
`name`	Column name from the file header. If no header, refers to the column index (0-based). Also used as the target table column name unless `rename` is set.	(required)
`type`	Column type: `string`, `int`, `long`, `double`, `float`, `boolean`, or any `date`/`timestamp` variant defined in `metadata/types/default.sl.yml`. You can add custom types in the `metadata/types/` directory.	`string`
`required`	Whether the column must have a non-null value	`false`
`privacy`	Privacy transformation to apply (e.g., masking, hashing)	None
`script`	SQL transformation to apply to the column value	None
`comment`	Column comment	None
`rename`	Target column name in the database (overrides `name`)	`name`
`default`	Default value when the column is NULL	None
`trim`	Trim strategy: `None`, `Left`, `Right`, `Both`	None

To add, replace, or ignore attributes during loading, see the Transform on load documentation.

Complete CSV Load Configuration Example

The configuration below describes a complete setup for loading order_line CSV files with APPEND write strategy:

metadata/load/<domain>/<table>.sl.yml
table:
  pattern: "order_line.*.csv"
  metadata:
    format: "DSV"
    withHeader: true
    separator: ";"
    writeStrategy:
      type: "APPEND"
  attributes:
  - name: "order_id"
    type: "int"
  - name: "product_id"
    type: "int"
  - name: "quantity"
    type: "int"
  - name: "sale_price"
    type: "double"

The writeStrategy.type controls how data is written to the target table. Available strategies: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.

Frequently Asked Questions

What delimiters does Starlake support for CSV files?

Starlake supports any single-character delimiter through the separator property in the metadata configuration. Common delimiters include semicolons (;), commas (,), pipes (|), and tabs. The default separator is semicolon (;). For multi-character delimiters, use the manual load configuration.

What encoding does Starlake use for CSV files?

The default encoding is UTF-8. Change it by setting the encoding property in the metadata section. Any Java-supported encoding is accepted (e.g., ISO-8859-1, Windows-1252).

How do I load a CSV file without a header row?

Set withHeader: false in the metadata section. Column names in the attributes section then refer to column indices (0-based) instead of header names.

How does Starlake infer the schema of a CSV file?

Run starlake infer-schema --input incoming/<domain>. This generates a .sl.yml YAML file with detected column names, types, and parsing options.

What write strategies are available for CSV loading?

Starlake supports APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.

Can I rename columns during CSV loading?

Yes. Use the rename property on any attribute to set a different column name in the target table.

How does Starlake handle empty values in CSV files?

By default, empty values are kept as empty strings. Set emptyIsNull: true in the metadata section to treat empty values as NULL.

What happens to rows that fail validation?

Rows that do not match the defined schema (wrong type, missing required field) are sent to the audit.rejected table with the rejection reason.

Can I apply transformations during CSV loading?

Yes. Use the script and privacy properties on individual attributes, or see the Transform on load documentation for adding, replacing, or ignoring attributes.

Infer CSV Schema Automatically with infer-schema​

CSV Parsing Options: Separator, Encoding, Quote, Escape​

Available Parsing Options​

Column Validation: Types, Required Fields, Transforms​

Available Attribute Properties​

Complete CSV Load Configuration Example​

Frequently Asked Questions​

What delimiters does Starlake support for CSV files?​

What encoding does Starlake use for CSV files?​

How do I load a CSV file without a header row?​

How does Starlake infer the schema of a CSV file?​

What write strategies are available for CSV loading?​

Can I rename columns during CSV loading?​

How does Starlake handle empty values in CSV files?​

What happens to rows that fail validation?​

Can I apply transformations during CSV loading?​