JSON Files Ingestion

1. Format Detection

Starlake supports three JSON formats, selected via the format metadata property or auto-detected from file content:

JSON: one JSON object per line (newline-delimited JSON). Supports nested structures.
JSON_FLAT: flat objects without nesting or arrays (faster parsing).
JSON_ARRAY: a single JSON array containing multiple objects.

Auto-detection: if the file starts with [ it is treated as JSON_ARRAY; if it starts with { it is treated as JSON.

table:
  metadata:
    format: "JSON"

2. Multiline Support

When multiline is true, JSON objects can span multiple lines. Default is false (one object per line), which is faster for large files.

table:
  metadata:
    multiline: false

3. Nested Structure Support

JSON objects can contain nested structures mapped with type: "struct" and repeated elements mapped with array: true. Multi-level nesting is supported.

attributes:
  - name: "order_lines"
    type: "struct"
    array: true
    attributes:
      - name: "line_id"
        type: "integer"
      - name: "product_id"
        type: "integer"
      - name: "quantity"
        type: "integer"

Warehouse support for nested types varies:

Warehouse	Nested Type
BigQuery	Native STRUCT and ARRAY
Databricks	Native STRUCT and ARRAY
Snowflake	VARIANT column
Redshift	Requires flattening
DuckDB	Native STRUCT and ARRAY

4. Encoding

The file is read according to the configured encoding (default: UTF-8). Override with the encoding property in the table metadata.

5. Row Filtering

The filter option at the table level applies a SQL WHERE clause to filter accepted data before writing. Applied after column renaming.

table:
  filter: "status != 'DELETED'"

6. Type Validation

Every record is validated against the type schema before loading. Each attribute is assigned a type backed by a regex pattern. Values that match are loaded into the target table; values that fail are rejected to the audit.rejected table.

Built-in types include: string, int, long, double, decimal, boolean, date, timestamp, and numerous ISO date/time variants (ISO_DATE_TIME, ISO_LOCAL_DATE, RFC_1123_DATE_TIME, etc.). Custom types with user-defined regex patterns are also supported.

7. Required Field Check

Each attribute can be marked required: true. When set, the field must be present and non-null in the source record or the row is rejected.

8. Column Renaming

The rename property maps a source field to a different target column name in the database. Table-level renaming is also supported via table.rename.

attributes:
  - name: "src_field"
    rename: "target_col"

9. Privacy Transformations

The privacy property applies a transformation to sensitive column values during ingestion. Built-in algorithms:

Algorithm	Description
`NONE`	No transformation
`HIDE`	Replace with a fixed string (e.g., `***`)
`MD5`	MD5 hash
`SHA1`	SHA-1 hash
`SHA256`	SHA-256 hash
`SHA512`	SHA-512 hash
`AES`	AES encryption (requires parameters)

SQL-based privacy expressions are also supported:

attributes:
  - name: "email"
    privacy: "SHA256"
  - name: "phone"
    privacy: "SQL: CONCAT(SUBSTR(phone, 1, 3), '****')"

10. Computed Columns (Script)

The script property defines a Spark SQL expression to compute a derived column. Scripts can reference other columns, Spark SQL functions, and the file metadata column sl_input_file_name. Scripted fields must appear at the end of the attribute list and are automatically made optional (nullable).

attributes:
  - name: "order_id"
    type: "integer"
  - name: "source_file"
    script: "regexp_extract(sl_input_file_name, '.+/(.+)$', 1)"
  - name: "loaded_at"
    script: "current_timestamp()"
  - name: "full_name"
    script: "concat(first_name, ' ', last_name)"

11. Ignored Columns

Columns marked with ignore: true are excluded from the target table but remain available for use in script expressions.

12. Default Values

Optional attributes can specify a default value used when the source value is NULL. Not valid for struct or array types.

attributes:
  - name: "country"
    type: "string"
    default: "UNKNOWN"

13. Foreign Keys

The foreignKey property declares a relationship to another table. Supported syntaxes: table, domain.table, table.column, domain.table.column.

attributes:
  - name: "customer_id"
    type: "int"
    foreignKey: "customers.id"

14. Pre/Post SQL Hooks

SQL statements can be executed before and after ingestion via presql and postsql.

table:
  presql:
    - "TRUNCATE TABLE staging.events"
  postsql:
    - "CALL refresh_materialized_view('events_summary')"

15. Rejection Routing

Records that fail any validation check are not silently dropped — they are routed to the audit.rejected table for inspection and reporting.

16. Post-Load Expectations

After data is written to the target table, optional data quality assertions are evaluated. See the full Expectations reference for all 53 built-in macros covering completeness, validity, volume, schema, uniqueness, and numeric checks.

17. Freshness Monitoring

The freshness property defines staleness thresholds for the loaded data.

table:
  metadata:
    freshness:
      warn: "6h"
      error: "1d"

18. Primary Key Declaration

The primaryKey property declares the table's primary key columns, used for merge operations and data quality.

19. Ingestion Metrics

Attributes can be tagged with a metricType to compute statistics during ingestion:

CONTINUOUS — min, max, mean, median, variance, standard deviation, skewness, kurtosis, percentiles.
DISCRETE — count distinct, category frequency, category count.
TEXT — text field statistics.

Summary

Check	Phase
Format detection (JSON / JSON_FLAT / JSON_ARRAY)	Pre-load
Multiline handling (`multiline`)	Pre-load
Nested structure parsing (`struct`, `array`)	Pre-load
Encoding	Pre-load
Pre-SQL hooks (`presql`)	Pre-load
Type validation via regex	Pre-load
Required field presence	Pre-load
Column renaming (`rename`)	Pre-load
Privacy transformations (`privacy`)	Pre-load
Computed columns (`script`)	Pre-load
Default values (`default`)	Pre-load
Ignored columns (`ignore`)	Pre-load
Row filtering (`filter`)	Pre-load
Rejection to `audit.rejected`	Load
Post-SQL hooks (`postsql`)	Post-load
Post-load expectations	Post-load
Freshness monitoring	Post-load
Ingestion metrics	Post-load

1. Format Detection​

2. Multiline Support​

3. Nested Structure Support​

4. Encoding​

5. Row Filtering​

6. Type Validation​

7. Required Field Check​

8. Column Renaming​

9. Privacy Transformations​

10. Computed Columns (Script)​

11. Ignored Columns​

12. Default Values​

13. Foreign Keys​

14. Pre/Post SQL Hooks​

15. Rejection Routing​

16. Post-Load Expectations​

17. Freshness Monitoring​

18. Primary Key Declaration​

19. Ingestion Metrics​

Summary​