Skip to main content

Load XML Files with Node Attributes and Nested Elements

Starlake loads XML files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses with full support for node attributes, child elements, and nested structures. It uses a configurable rowTag to identify records, prefixes node attributes with _ to avoid naming collisions, and supports optional XSD validation. Nested child elements map to STRUCT types and repeated elements to ARRAY types -- similar to JSON loading.

XML Structure and Mapping Concepts

Loading XML files is similar to loading JSON files: attributes can be nested and repeated. The key difference is that XML uses child elements and node attributes to represent data.

Consider this XML file:

<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</data>

How Starlake Maps XML Elements to Table Columns

Starlake identifies a row by the rowTag element -- in the example above, country. Each country element becomes one row in the target table.

Child elements and node attributes map to columns as follows:

  • Child elements (rank, year, gdppc) become regular columns.
  • Node attributes (name on country) become columns prefixed with _ (e.g., _name).
  • Repeated child elements (neighbor) become ARRAY columns.
  • Nested attributes on repeated elements (_name, _direction on neighbor) are prefixed with _.

The resulting table structure for country:

ColumnSource
_nameNode attribute on country
rankChild element
yearChild element
gdppcChild element
neighbor (array of struct)Repeated child element
neighbor._nameNode attribute on neighbor
neighbor._directionNode attribute on neighbor

The _ prefix distinguishes node attributes from child elements and prevents naming collisions.

note

The target warehouse must support nested and repeated attributes. BigQuery and Databricks support them natively. Redshift does not -- flatten the XML structure before loading.

Infer XML Schema Automatically

Start by inferring the schema from a sample file. The infer-schema command detects the XML structure and generates a YAML configuration.

starlake infer-schema --input incoming/starbake

This generates a <table>.sl.yml file under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details.

XML Parsing Options: rowTag, attributePrefix, XSD Validation

Customize the generated configuration with the following options:

OptionDescriptionDefault
rowTagThe XML element that represents a single record (row)Auto-detected
attributePrefixPrefix for node attributes to avoid collisions with child elements_
rowValidationXSDPathPath to an XSD file for record-level validationNone
metadata/load/<domain>/<table>.sl.yml
table:
pattern: "country.*.xml"
metadata:
format: "XML"
writeStrategy:
type: "OVERWRITE"
options:
rowTag: "country"
attributePrefix: "_"
rowValidationXSDPath: "/path/to/schema.xsd"
attributes:
- name: "_name"
type: "string"
- name: "rank"
type: "int"
- name: "year"
type: "int"
- name: "gdppc"
type: "int"
- name: "neighbor"
type: "struct"
array: true
attributes:
- name: "_name"
type: "string"
- name: "_direction"
type: "string"

For the full list of XML parsing options, see the Spark XML reference.

Frequently Asked Questions

How does Starlake distinguish XML node attributes from child elements?

Starlake prefixes node attributes with _ (underscore) by default. For example, <country name="France"> produces a column _name. This avoids naming collisions with child elements.

What is the rowTag in Starlake XML configuration?

The rowTag specifies which XML element represents a single record (row). For example, if your XML has <data><country>...</country></data>, the rowTag is country.

Can I validate XML files against an XSD schema?

Yes. Set the rowValidationXSDPath option in the metadata section to point to your XSD file. Starlake validates each record against the schema during loading.

Which data warehouses support nested XML attributes?

BigQuery and Databricks support nested and repeated attributes natively. Redshift does not support nested types. For unsupported warehouses, flatten the XML structure before loading.

How do I change the attribute prefix for XML node attributes?

Set the attributePrefix option in the metadata section. The default is _. You can change it to any string.

Can Starlake load XML files with multiple levels of nesting?

Yes. Nested child elements are mapped to STRUCT types, and repeated elements to ARRAY types, similar to JSON loading.

Where can I find all XML parsing options?

Starlake uses the Spark XML library. The full list of options is available in the Spark XML GitHub repository documentation.

What write strategies work with XML files?

All Starlake write strategies work with XML files: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.