Skip to main content

Load JSON Files with Nested and Repeated Attributes

Starlake loads JSON files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses with full support for nested and repeated attributes. It supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects, faster parsing), and JSON_ARRAY (a single array of objects). Schema inference automatically detects nested structures and generates the corresponding YAML configuration.

Loading JSON is similar to loading CSV files, with one key difference: JSON attributes can be nested (objects within objects) and repeated (arrays of objects).

Infer JSON Schema Automatically

Start by inferring the schema from a sample file. The infer-schema command detects nested structures, column types, and generates a YAML configuration file.

starlake infer-schema --input incoming/starbake

This bootstraps the configuration file for the table under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how JSON detection works.

JSON Format Options: JSON, JSON_FLAT, JSON_ARRAY

The format property in the metadata section determines how Starlake parses the JSON file.

FormatDescriptionWhen to use
JSONOne JSON object per line (newline-delimited JSON)Files with nested or repeated fields
JSON_FLATFlat JSON objects with no nesting or arraysFaster parsing for simple, flat data
JSON_ARRAYA single JSON array containing multiple objectsFiles that start with [
metadata/load/<domain>/<table>.sl.yml - parsing section
table:
pattern: "order.*.json"
metadata:
format: "JSON" # or "JSON_FLAT" for flat objects, "JSON_ARRAY" for arrays
writeStrategy:
type: "APPEND"
attributes:
- ...

During autoload, Starlake auto-detects the format: if the file starts with [, it uses JSON_ARRAY; if it starts with {, it uses JSON.

Define Nested and Array Attributes in YAML

JSON attributes can contain nested objects (struct) and arrays of objects. Define them in the attributes section using:

  • type: "struct" for nested objects
  • array: true for repeated elements (arrays)

Example: JSON with Nested Arrays

Given this JSON record:

{
"order_id": 1,
"order_date": "2021-01-01",
"order_lines": [
{
"line_id": 1,
"product_id": 1,
"quantity": 2
},
{
"line_id": 2,
"product_id": 2,
"quantity": 3
}
]
}

The YAML configuration defines the nested structure:

metadata/load/<domain>/<table>.sl.yml - attributes section
table:
pattern: "order.*.json"
metadata:
format: "JSON"
writeStrategy:
type: "APPEND"
attributes:
- name: "order_id"
type: "integer"
- name: "order_date"
type: "date"
- name: "order_lines"
type: "struct"
array: true
attributes:
- name: "line_id"
type: "integer"
- name: "product_id"
type: "integer"
- name: "quantity"
type: "integer"

You can nest multiple levels deep. Each struct attribute can contain its own attributes list, with further struct or array definitions.

Warehouse Support for Nested Attributes

Not all warehouses support nested types natively:

WarehouseNested support
BigQueryNative STRUCT and ARRAY types
DatabricksNative STRUCT and ARRAY types
SnowflakeVARIANT columns
RedshiftRequires flattening before loading
DuckDBNative STRUCT and ARRAY types

Frequently Asked Questions

What JSON formats does Starlake support?

Starlake supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects without nesting, faster parsing), and JSON_ARRAY (a single array of objects).

How do I load a JSON file with nested attributes?

Define nested attributes using type: "struct" in the YAML schema. For repeated (array) attributes, add array: true to the attribute definition.

What is the difference between JSON and JSON_FLAT in Starlake?

JSON_FLAT is optimized for files where no object contains nested or repeated fields. It parses faster than JSON. Use JSON when your data has nested structures.

How does Starlake detect if a JSON file is an array?

During autoload, Starlake checks the first character of the file. If it starts with [, the format is JSON_ARRAY. If it starts with {, the format is JSON.

Can I load JSON files with mixed nesting levels?

Yes. Define each nesting level in the attributes section using type: "struct" for objects and array: true for arrays of objects.

How do I infer the schema of a JSON file?

Run starlake infer-schema --input incoming/<domain>. The command detects nested structures and generates the corresponding YAML configuration.

Which data warehouses support nested JSON attributes?

BigQuery and Databricks natively support nested and repeated attributes (STRUCT and ARRAY types). Snowflake supports them via VARIANT columns. Warehouses like Redshift require flattening.

What write strategies are available for JSON loading?

All Starlake write strategies work with JSON files: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.