Load JSON Files with Nested and Repeated Attributes
Starlake loads JSON files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses with full support for nested and repeated attributes. It supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects, faster parsing), and JSON_ARRAY (a single array of objects). Schema inference automatically detects nested structures and generates the corresponding YAML configuration.
Loading JSON is similar to loading CSV files, with one key difference: JSON attributes can be nested (objects within objects) and repeated (arrays of objects).
Infer JSON Schema Automatically
Start by inferring the schema from a sample file. The infer-schema command detects nested structures, column types, and generates a YAML configuration file.
starlake infer-schema --input incoming/starbake
This bootstraps the configuration file for the table under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how JSON detection works.
JSON Format Options: JSON, JSON_FLAT, JSON_ARRAY
The format property in the metadata section determines how Starlake parses the JSON file.
| Format | Description | When to use |
|---|---|---|
JSON | One JSON object per line (newline-delimited JSON) | Files with nested or repeated fields |
JSON_FLAT | Flat JSON objects with no nesting or arrays | Faster parsing for simple, flat data |
JSON_ARRAY | A single JSON array containing multiple objects | Files that start with [ |
table:
pattern: "order.*.json"
metadata:
format: "JSON" # or "JSON_FLAT" for flat objects, "JSON_ARRAY" for arrays
writeStrategy:
type: "APPEND"
attributes:
- ...
During autoload, Starlake auto-detects the format: if the file starts with [, it uses JSON_ARRAY; if it starts with {, it uses JSON.
Define Nested and Array Attributes in YAML
JSON attributes can contain nested objects (struct) and arrays of objects. Define them in the attributes section using:
type: "struct"for nested objectsarray: truefor repeated elements (arrays)
Example: JSON with Nested Arrays
Given this JSON record:
{
"order_id": 1,
"order_date": "2021-01-01",
"order_lines": [
{
"line_id": 1,
"product_id": 1,
"quantity": 2
},
{
"line_id": 2,
"product_id": 2,
"quantity": 3
}
]
}
The YAML configuration defines the nested structure:
table:
pattern: "order.*.json"
metadata:
format: "JSON"
writeStrategy:
type: "APPEND"
attributes:
- name: "order_id"
type: "integer"
- name: "order_date"
type: "date"
- name: "order_lines"
type: "struct"
array: true
attributes:
- name: "line_id"
type: "integer"
- name: "product_id"
type: "integer"
- name: "quantity"
type: "integer"
You can nest multiple levels deep. Each struct attribute can contain its own attributes list, with further struct or array definitions.
Warehouse Support for Nested Attributes
Not all warehouses support nested types natively:
| Warehouse | Nested support |
|---|---|
| BigQuery | Native STRUCT and ARRAY types |
| Databricks | Native STRUCT and ARRAY types |
| Snowflake | VARIANT columns |
| Redshift | Requires flattening before loading |
| DuckDB | Native STRUCT and ARRAY types |
Frequently Asked Questions
What JSON formats does Starlake support?
Starlake supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects without nesting, faster parsing), and JSON_ARRAY (a single array of objects).
How do I load a JSON file with nested attributes?
Define nested attributes using type: "struct" in the YAML schema. For repeated (array) attributes, add array: true to the attribute definition.
What is the difference between JSON and JSON_FLAT in Starlake?
JSON_FLAT is optimized for files where no object contains nested or repeated fields. It parses faster than JSON. Use JSON when your data has nested structures.
How does Starlake detect if a JSON file is an array?
During autoload, Starlake checks the first character of the file. If it starts with [, the format is JSON_ARRAY. If it starts with {, the format is JSON.
Can I load JSON files with mixed nesting levels?
Yes. Define each nesting level in the attributes section using type: "struct" for objects and array: true for arrays of objects.
How do I infer the schema of a JSON file?
Run starlake infer-schema --input incoming/<domain>. The command detects nested structures and generates the corresponding YAML configuration.
Which data warehouses support nested JSON attributes?
BigQuery and Databricks natively support nested and repeated attributes (STRUCT and ARRAY types). Snowflake supports them via VARIANT columns. Warehouses like Redshift require flattening.
What write strategies are available for JSON loading?
All Starlake write strategies work with JSON files: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.