Transform on Load

Q: How do I add a computed column during load in Starlake?

Add an attribute with a script property in the table YAML file. The script accepts any Spark SQL function and can reference other columns in the same record.

Q: How do I add the source file name as a column?

Define an attribute with script: "_metadata.file_name". The name of the file being loaded will be added as the value of that column.

Q: Can I ignore columns during load?

Yes. Set ignore: true on an attribute. The column will not be loaded into the target table but remains usable in other columns' scripts.

Q: How do I define a foreign key on a column?

Use the foreignKey property on the attribute. Supported syntaxes: " ", " . ", " . . ", or " . ".

Q: What file metadata properties are available during load?

Four properties: file_name (name), file_path (path), file_size (size in bytes), and file_modification_time (last modification date).

Starlake enriches records during ingestion through three features declared in the table YAML:

Computed columns -- Add new columns using any Spark SQL expression.
Ignore attribute -- Exclude sensitive or unnecessary columns from the target table (useful for GDPR compliance).
Foreign keys -- Declare relationships between tables for documentation and ER diagram generation.

All transformations happen at load time, before data reaches the target table.

How to add computed columns

Open the table YAML file -- Edit metadata/load/<domain>/<table>.sl.yml.
Add an attribute with a script property -- The script accepts any Spark SQL expression and can reference other columns or _metadata properties.
Optionally ignore source columns -- Set ignore: true on any attribute you want to exclude from the target table while still using it in scripts.
Optionally declare foreign keys -- Add foreignKey: "<table>.<column>" on an attribute to create a relationship.
Run the load -- Execute starlake load to ingest files with the computed columns applied.

Computed columns with the `script` property

Add extra attributes that do not exist in the source file. The script property accepts any Spark SQL function and can reference any column in the same record, including ignored columns.

File metadata properties

In addition to Spark SQL functions, you can reference properties from the hidden _metadata column:

Property	Type	Description
`file_name`	string	Name of the file being loaded
`file_path`	string	Path of the file being loaded
`file_size`	long	Size of the file in bytes
`file_modification_time`	timestamp	Last modified date of the file

Example: computed columns and file metadata

metadata/load/<domain>/<table>.sl.yml
table:
  pattern: "order_line.*.csv"
  ...
  attributes:
  ...
  - name: "total_price"
    script: "quantity * sale_price"
  - name: "quantity_in_hexadecimal"
    script: "hex(quantity)"
  - name: "file_name"
    script: "_metadata.file_name"

In this example:

total_price multiplies two source columns.
quantity_in_hexadecimal applies the hex() Spark SQL function.
file_name captures the name of the ingested file.

Ignore columns

Set ignore: true on an attribute to exclude it from the target table. The column is still available in other attributes' script expressions. This supports GDPR compliance by preventing sensitive data from reaching the data warehouse. See also column-level security for access control on BigQuery.

tip

You may ignore a column and still use it inside the script attribute of another column.

metadata/load/<domain>/<table>.sl.yml
table:
  pattern: "order_line.*.csv"
  ...
  attributes:
  ...
  - name: "product_id"
    type: "int"
    ignore: true

Foreign key declarations

Declare foreign keys on attributes using the foreignKey property. Starlake sets the constraint on warehouses that support it and generates an entity-relational diagram in all cases.

metadata/load/<domain>/<table>.sl.yml
table:
  pattern: "order_line.*.csv"
  ...
  attributes:
  ...
  - name: "product_id"
    type: "int"
    foreignKey: "product.id"

Supported `foreignKey` syntax

Syntax	Description
`foreignKey: "<table>"`	Foreign key on the primary key of the specified table in the same domain.
`foreignKey: "<domain>.<table>"`	Foreign key on the primary key of the specified table in the specified domain.
`foreignKey: "<domain>.<table>.<column>"`	Foreign key on a specific column in the specified table and domain.
`foreignKey: "<table>.<column>"`	Foreign key on a specific column in the specified table in the same domain.

note

Some data warehouses do not support foreign keys. The constraint is ignored at runtime but still appears in the auto-generated entity-relational diagram.

Frequently Asked Questions

How do I add a computed column during load in Starlake?

Add an attribute with a script property in the table YAML file. The script accepts any Spark SQL function and can reference other columns in the same record.

What functions are available in the script property?

All Spark SQL functions are available, as well as the hidden _metadata column properties: file_name, file_path, file_size and file_modification_time.

How do I add the source file name as a column?

Define an attribute with script: "_metadata.file_name". The name of the file being loaded will be added as the value of that column.

Can I ignore columns during load?

Yes. Set ignore: true on an attribute. The column will not be loaded into the target table but remains usable in other columns' scripts.

Yes. Ignoring an attribute containing sensitive data prevents it from being loaded into the target table. The ignored column can still be used in a script for intermediate computation.

How do I define a foreign key on a column?

Use the foreignKey property on the attribute. Supported syntaxes: "<table>", "<domain>.<table>", "<domain>.<table>.<column>", or "<table>.<column>".

Are foreign keys enforced on all warehouses?

No. Some data warehouses do not support foreign keys. In that case, the constraint is ignored at runtime but still appears in the auto-generated entity-relational diagram.

What file metadata properties are available during load?

Four properties: file_name (name), file_path (path), file_size (size in bytes), and file_modification_time (last modification date).

How to add computed columns​

Computed columns with the script property​

File metadata properties​

Example: computed columns and file metadata​

Ignore columns​

Foreign key declarations​

Supported foreignKey syntax​

Frequently Asked Questions​

How do I add a computed column during load in Starlake?​

What functions are available in the script property?​

How do I add the source file name as a column?​

Can I ignore columns during load?​

Can the ignore attribute help with GDPR compliance?​

How do I define a foreign key on a column?​

Are foreign keys enforced on all warehouses?​

What file metadata properties are available during load?​