Load XML Files with Node Attributes and Nested Elements
Starlake loads XML files into BigQuery, Snowflake, Databricks, DuckDB, and other warehouses with full support for node attributes, child elements, and nested structures. It uses a configurable rowTag to identify records, prefixes node attributes with _ to avoid naming collisions, and supports optional XSD validation. Nested child elements map to STRUCT types and repeated elements to ARRAY types -- similar to JSON loading.
XML Structure and Mapping Concepts
Loading XML files is similar to loading JSON files: attributes can be nested and repeated. The key difference is that XML uses child elements and node attributes to represent data.
Consider this XML file:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</data>
How Starlake Maps XML Elements to Table Columns
Starlake identifies a row by the rowTag element -- in the example above, country. Each country element becomes one row in the target table.
Child elements and node attributes map to columns as follows:
- Child elements (
rank,year,gdppc) become regular columns. - Node attributes (
nameoncountry) become columns prefixed with_(e.g.,_name). - Repeated child elements (
neighbor) become ARRAY columns. - Nested attributes on repeated elements (
_name,_directiononneighbor) are prefixed with_.
The resulting table structure for country:
| Column | Source |
|---|---|
_name | Node attribute on country |
rank | Child element |
year | Child element |
gdppc | Child element |
neighbor (array of struct) | Repeated child element |
neighbor._name | Node attribute on neighbor |
neighbor._direction | Node attribute on neighbor |
The _ prefix distinguishes node attributes from child elements and prevents naming collisions.
The target warehouse must support nested and repeated attributes. BigQuery and Databricks support them natively. Redshift does not -- flatten the XML structure before loading.
Infer XML Schema Automatically
Start by inferring the schema from a sample file. The infer-schema command detects the XML structure and generates a YAML configuration.
starlake infer-schema --input incoming/starbake
This generates a <table>.sl.yml file under metadata/load/<domain>/. Review and customize it before loading. See the autoload format detection documentation for details.
XML Parsing Options: rowTag, attributePrefix, XSD Validation
Customize the generated configuration with the following options:
| Option | Description | Default |
|---|---|---|
rowTag | The XML element that represents a single record (row) | Auto-detected |
attributePrefix | Prefix for node attributes to avoid collisions with child elements | _ |
rowValidationXSDPath | Path to an XSD file for record-level validation | None |
table:
pattern: "country.*.xml"
metadata:
format: "XML"
writeStrategy:
type: "OVERWRITE"
options:
rowTag: "country"
attributePrefix: "_"
rowValidationXSDPath: "/path/to/schema.xsd"
attributes:
- name: "_name"
type: "string"
- name: "rank"
type: "int"
- name: "year"
type: "int"
- name: "gdppc"
type: "int"
- name: "neighbor"
type: "struct"
array: true
attributes:
- name: "_name"
type: "string"
- name: "_direction"
type: "string"
For the full list of XML parsing options, see the Spark XML reference.
Frequently Asked Questions
How does Starlake distinguish XML node attributes from child elements?
Starlake prefixes node attributes with _ (underscore) by default. For example, <country name="France"> produces a column _name. This avoids naming collisions with child elements.
What is the rowTag in Starlake XML configuration?
The rowTag specifies which XML element represents a single record (row). For example, if your XML has <data><country>...</country></data>, the rowTag is country.
Can I validate XML files against an XSD schema?
Yes. Set the rowValidationXSDPath option in the metadata section to point to your XSD file. Starlake validates each record against the schema during loading.
Which data warehouses support nested XML attributes?
BigQuery and Databricks support nested and repeated attributes natively. Redshift does not support nested types. For unsupported warehouses, flatten the XML structure before loading.
How do I change the attribute prefix for XML node attributes?
Set the attributePrefix option in the metadata section. The default is _. You can change it to any string.
Can Starlake load XML files with multiple levels of nesting?
Yes. Nested child elements are mapped to STRUCT types, and repeated elements to ARRAY types, similar to JSON loading.
Where can I find all XML parsing options?
Starlake uses the Spark XML library. The full list of options is available in the Spark XML GitHub repository documentation.
What write strategies work with XML files?
All Starlake write strategies work with XML files: APPEND, OVERWRITE, OVERWRITE_BY_PARTITION, UPSERT_BY_KEY, UPSERT_BY_KEY_AND_TIMESTAMP, and SCD2.