Load JSON Files with Nested and Repeated Attributes

Q: What JSON formats does Starlake support?

Starlake supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects without nesting, faster parsing), and JSON_ARRAY (a single array of objects).

Q: How do I load a JSON file with nested attributes?

Define nested attributes using type: 'struct' in the YAML schema. For repeated (array) attributes, add array: true to the attribute definition.

Q: What is the difference between JSON and JSON_FLAT in Starlake?

JSON_FLAT is optimized for files where no object contains nested or repeated fields. It parses faster than JSON. Use JSON when your data has nested structures.

Q: How do I infer the schema of a JSON file?

Run 'starlake infer-schema --input incoming/ '. The command detects nested structures and generates the corresponding YAML configuration.

Starlake loads JSON files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses with full support for nested and repeated attributes. It supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects, faster parsing), and JSON_ARRAY (a single array of objects). Schema inference automatically detects nested structures and generates the corresponding YAML configuration.

Loading JSON is similar to loading CSV files, with one key difference: JSON attributes can be nested (objects within objects) and repeated (arrays of objects).

Infer JSON Schema Automatically

Start by inferring the schema from a sample file. The infer-schema command detects nested structures, column types, and generates a YAML configuration file.

starlake infer-schema --input incoming/starbake

This bootstraps the configuration file for the table under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details on how JSON detection works.

JSON Format Options: JSON, JSON_FLAT, JSON_ARRAY

The format property in the metadata section determines how Starlake parses the JSON file.

Format	Description	When to use
`JSON`	One JSON object per line (newline-delimited JSON)	Files with nested or repeated fields
`JSON_FLAT`	Flat JSON objects with no nesting or arrays	Faster parsing for simple, flat data
`JSON_ARRAY`	A single JSON array containing multiple objects	Files that start with `[`

metadata/load/<domain>/<table>.sl.yml - parsing section
table:
  pattern: "order.*.json"
  metadata:
    format: "JSON" # or "JSON_FLAT" for flat objects, "JSON_ARRAY" for arrays
    writeStrategy:
      type: "APPEND"
  attributes:
    - ...

During autoload, Starlake auto-detects the format: if the file starts with [, it uses JSON_ARRAY; if it starts with {, it uses JSON.

Define Nested and Array Attributes in YAML

JSON attributes can contain nested objects (struct) and arrays of objects. Define them in the attributes section using:

type: "struct" for nested objects
array: true for repeated elements (arrays)

Example: JSON with Nested Arrays

Given this JSON record:

{
  "order_id": 1,
  "order_date": "2021-01-01",
  "order_lines": [
    {
      "line_id": 1,
      "product_id": 1,
      "quantity": 2
    },
    {
      "line_id": 2,
      "product_id": 2,
      "quantity": 3
    }
  ]
}

The YAML configuration defines the nested structure:

metadata/load/<domain>/<table>.sl.yml - attributes section
table:
  pattern: "order.*.json"
  metadata:
    format: "JSON"
    writeStrategy:
      type: "APPEND"
  attributes:
    - name: "order_id"
      type: "integer"
    - name: "order_date"
      type: "date"
    - name: "order_lines"
      type: "struct"
      array: true
      attributes:
        - name: "line_id"
          type: "integer"
        - name: "product_id"
          type: "integer"
        - name: "quantity"
          type: "integer"

You can nest multiple levels deep. Each struct attribute can contain its own attributes list, with further struct or array definitions.

Warehouse Support for Nested Attributes

Not all warehouses support nested types natively:

Warehouse	Nested support
BigQuery	Native STRUCT and ARRAY types
Databricks	Native STRUCT and ARRAY types
Snowflake	VARIANT columns
Redshift	Requires flattening before loading
DuckDB	Native STRUCT and ARRAY types

Frequently Asked Questions

What JSON formats does Starlake support?

Starlake supports three JSON formats: JSON (one object per line), JSON_FLAT (flat objects without nesting, faster parsing), and JSON_ARRAY (a single array of objects).

How do I load a JSON file with nested attributes?

Define nested attributes using type: "struct" in the YAML schema. For repeated (array) attributes, add array: true to the attribute definition.

What is the difference between JSON and JSON_FLAT in Starlake?

JSON_FLAT is optimized for files where no object contains nested or repeated fields. It parses faster than JSON. Use JSON when your data has nested structures.

How does Starlake detect if a JSON file is an array?

During autoload, Starlake checks the first character of the file. If it starts with [, the format is JSON_ARRAY. If it starts with {, the format is JSON.

Can I load JSON files with mixed nesting levels?

Yes. Define each nesting level in the attributes section using type: "struct" for objects and array: true for arrays of objects.

How do I infer the schema of a JSON file?

Run starlake infer-schema --input incoming/<domain>. The command detects nested structures and generates the corresponding YAML configuration.

Which data warehouses support nested JSON attributes?

BigQuery and Databricks natively support nested and repeated attributes (STRUCT and ARRAY types). Snowflake supports them via VARIANT columns. Warehouses like Redshift require flattening.

What write strategies are available for JSON loading?

All Starlake write strategies work with JSON files: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.

Infer JSON Schema Automatically​

JSON Format Options: JSON, JSON_FLAT, JSON_ARRAY​

Define Nested and Array Attributes in YAML​

Example: JSON with Nested Arrays​

Warehouse Support for Nested Attributes​

Frequently Asked Questions​

What JSON formats does Starlake support?​

How do I load a JSON file with nested attributes?​

What is the difference between JSON and JSON_FLAT in Starlake?​

How does Starlake detect if a JSON file is an array?​

Can I load JSON files with mixed nesting levels?​

How do I infer the schema of a JSON file?​

Which data warehouses support nested JSON attributes?​

What write strategies are available for JSON loading?​