Data Quality Expectations
Expectations are post-load data quality assertions that run automatically after Starlake writes data to the target table. They complement pre-load type validation by checking business rules on the loaded dataset: uniqueness, row count ranges, value distributions and custom conditions.
Each expectation references a reusable Jinja2 SQL macro, passes parameters and evaluates a condition on the query result. Set failOnError: true to halt the pipeline on failure, turning expectations into a lightweight data quality gate.
How to add data quality expectations
- Create or reuse an expectation macro -- Place a
.j2Jinja2 template in theexpectationsdirectory. UseSL_THISas a placeholder for the target table name. - Open the table YAML file -- Edit
metadata/load/<domain>/<table>.sl.yml. - Add an
expectationssection -- Under thetablekey, define a list of entries. - Write the expectation expression -- Use the format
<query_name>(<params>) => <condition>. Available condition variables arecount,resultandresults. - Set
failOnError-- Set totrueto halt the pipeline on failure, orfalseto log and continue. - Run the load -- Execute
starlake load. Expectations are evaluated after the data has been written.
Defining expectations in the table YAML
Add expectations in the expectations section of the table definition. Each entry contains an expect expression and an optional failOnError flag.
table:
...
attributes:
- name: id
type: integer
...
expectations:
- expect: "is_col_value_not_unique('id') => result(0) == 1"
failOnError: true # or false if you want to continue
Expectation expression format
The expectation expression follows this pattern:
<query_name>(<param>*) => <condition>
query_name-- Name of a Jinja2 macro defined in theexpectationsdirectory. The macro generates a SQLSELECTstatement run against the target table.param-- Parameters passed to the macro, separated by commas.condition-- A boolean expression evaluated against the query result.
Condition variables
| Variable | Type | Description |
|---|---|---|
count | Long | Number of rows in the result |
result | Seq[Any] | First row as a collection of values (one per column) |
results | Seq[Seq[Any]] | All rows as a collection of rows, each row as a collection of values |
Writing expectation query macros
Expectation queries are Jinja2 macros stored as .j2 files in the expectations directory. The SL_THIS placeholder represents the target table. You can organize macros in subdirectories.
Built-in macro examples
Starlake provides the following reusable macros. They are customizable and extensible.
{% macro is_col_value_not_unique(col, table='SL_THIS') %}
SELECT max(cnt)
FROM (SELECT {{ col }}, count(*) as cnt FROM {{ table }}
GROUP BY {{ col }}
HAVING cnt > 1)
{% endmacro %}
{% macro is_row_count_to_be_between(min_value, max_value, table_name = 'SL_THIS') -%}
select
case
when count(*) between {{min_value}} and {{max_value}} then 1
else
0
end
from {{table_name}}
{%- endmacro %}
{% macro col_value_count_greater_than(col, min_count, table_name='SL_THIS') %}
SELECT {{ col }}, count(*) FROM {{ table_name }}
GROUP BY {{ col }}
HAVING count(*) > {{ min_count }}
{% endmacro %}
{% macro count_by_value(col, value, table='SL_THIS') %}
SELECT count(*)
FROM {{ table }}
WHERE {{ col }} LIKE '{{ value }}'
{% endmacro %}
{% macro column_occurs(col, times, table='SL_THIS') %}
SELECT max(cnt)
FROM (
SELECT {{ col }}, count(*) as cnt FROM {{ table }}
GROUP BY {{ col }}
HAVING cnt == {{ times }}
)
{% endmacro %}
Creating your own macros
Create a .j2 file in the expectations directory. Define a Jinja2 macro that generates a SQL SELECT statement. Use SL_THIS as the table placeholder. The query result is evaluated against the condition in the expectation expression.
Expectations vs type validation
| Aspect | Type validation | Expectations |
|---|---|---|
| When it runs | Before load (pre-load) | After load (post-load) |
| What it checks | Individual field format (regex) | Business rules on the full dataset |
| Failed records | Rejected to audit table | Pipeline halted or warning logged |
| Engine | Spark only | All engines |
Use both for comprehensive data quality: type validation catches format errors at the record level, and expectations verify aggregate conditions on the loaded data. You can also base expectations on ingestion metrics.
Frequently Asked Questions
What is an expectation in Starlake?
An expectation is a post-load assertion executed after data loading. It consists of a SQL query evaluated on the target table, whose result is compared to an expected condition.
How do I define an expectation in the table YAML file?
Add an expectations section with a list of entries. Each entry contains expect (query name + condition) and optionally failOnError: true to halt the pipeline on failure.
What is the format of an expectation expression?
The format is <query_name>(<params>) => <condition>. The query_name references a Jinja template defined in the expectations directory. The condition uses the variables count, result or results.
How do I write an expectation query template?
Templates are Jinja2 macros with a .j2 extension placed in the expectations directory. They generate SQL. The SL_THIS placeholder represents the target table.
What variables are available in an expectation condition?
Three variables: count (number of rows in the result), result (first row as a collection of values), and results (all rows as a collection of collections).
Can I stop the pipeline if an expectation fails?
Yes. Set failOnError: true on the expectation. If the condition is not satisfied, the pipeline halts with an error.
What built-in expectation macros does Starlake provide?
Starlake documents: is_col_value_not_unique, is_row_count_to_be_between, col_value_count_greater_than, count_by_value and column_occurs. They are customizable and extensible.