Transform on Load
Starlake enriches records during ingestion through three features declared in the table YAML:
- Computed columns -- Add new columns using any Spark SQL expression.
- Ignore attribute -- Exclude sensitive or unnecessary columns from the target table (useful for GDPR compliance).
- Foreign keys -- Declare relationships between tables for documentation and ER diagram generation.
All transformations happen at load time, before data reaches the target table.
How to add computed columns
- Open the table YAML file -- Edit
metadata/load/<domain>/<table>.sl.yml. - Add an attribute with a
scriptproperty -- The script accepts any Spark SQL expression and can reference other columns or_metadataproperties. - Optionally ignore source columns -- Set
ignore: trueon any attribute you want to exclude from the target table while still using it in scripts. - Optionally declare foreign keys -- Add
foreignKey: "<table>.<column>"on an attribute to create a relationship. - Run the load -- Execute
starlake loadto ingest files with the computed columns applied.
Computed columns with the script property
Add extra attributes that do not exist in the source file. The script property accepts any Spark SQL function and can reference any column in the same record, including ignored columns.
File metadata properties
In addition to Spark SQL functions, you can reference properties from the hidden _metadata column:
| Property | Type | Description |
|---|---|---|
file_name | string | Name of the file being loaded |
file_path | string | Path of the file being loaded |
file_size | long | Size of the file in bytes |
file_modification_time | timestamp | Last modified date of the file |
Example: computed columns and file metadata
table:
pattern: "order_line.*.csv"
...
attributes:
...
- name: "total_price"
script: "quantity * sale_price"
- name: "quantity_in_hexadecimal"
script: "hex(quantity)"
- name: "file_name"
script: "_metadata.file_name"
In this example:
total_pricemultiplies two source columns.quantity_in_hexadecimalapplies thehex()Spark SQL function.file_namecaptures the name of the ingested file.
Ignore columns
Set ignore: true on an attribute to exclude it from the target table. The column is still available in other attributes' script expressions. This supports GDPR compliance by preventing sensitive data from reaching the data warehouse. See also column-level security for access control on BigQuery.
You may ignore a column and still use it inside the script attribute of another column.
table:
pattern: "order_line.*.csv"
...
attributes:
...
- name: "product_id"
type: "int"
ignore: true
Foreign key declarations
Declare foreign keys on attributes using the foreignKey property. Starlake sets the constraint on warehouses that support it and generates an entity-relational diagram in all cases.
table:
pattern: "order_line.*.csv"
...
attributes:
...
- name: "product_id"
type: "int"
foreignKey: "product.id"
Supported foreignKey syntax
| Syntax | Description |
|---|---|
foreignKey: "<table>" | Foreign key on the primary key of the specified table in the same domain. |
foreignKey: "<domain>.<table>" | Foreign key on the primary key of the specified table in the specified domain. |
foreignKey: "<domain>.<table>.<column>" | Foreign key on a specific column in the specified table and domain. |
foreignKey: "<table>.<column>" | Foreign key on a specific column in the specified table in the same domain. |
Some data warehouses do not support foreign keys. The constraint is ignored at runtime but still appears in the auto-generated entity-relational diagram.
Frequently Asked Questions
How do I add a computed column during load in Starlake?
Add an attribute with a script property in the table YAML file. The script accepts any Spark SQL function and can reference other columns in the same record.
What functions are available in the script property?
All Spark SQL functions are available, as well as the hidden _metadata column properties: file_name, file_path, file_size and file_modification_time.
How do I add the source file name as a column?
Define an attribute with script: "_metadata.file_name". The name of the file being loaded will be added as the value of that column.
Can I ignore columns during load?
Yes. Set ignore: true on an attribute. The column will not be loaded into the target table but remains usable in other columns' scripts.
Can the ignore attribute help with GDPR compliance?
Yes. Ignoring an attribute containing sensitive data prevents it from being loaded into the target table. The ignored column can still be used in a script for intermediate computation.
How do I define a foreign key on a column?
Use the foreignKey property on the attribute. Supported syntaxes: "<table>", "<domain>.<table>", "<domain>.<table>.<column>", or "<table>.<column>".
Are foreign keys enforced on all warehouses?
No. Some data warehouses do not support foreign keys. In that case, the constraint is ignored at runtime but still appears in the auto-generated entity-relational diagram.
What file metadata properties are available during load?
Four properties: file_name (name), file_path (path), file_size (size in bytes), and file_modification_time (last modification date).