Skip to main content

Data Quality Expectations

Expectations are post-load data quality assertions that run automatically after Starlake writes data to the target table. They complement pre-load type validation by checking business rules on the loaded dataset: uniqueness, row count ranges, value distributions and custom conditions.

Each expectation references a reusable Jinja2 SQL macro, passes parameters and evaluates a condition on the query result. Set failOnError: true to halt the pipeline on failure, turning expectations into a lightweight data quality gate.

How to add data quality expectations

  1. Create or reuse an expectation macro -- Place a .j2 Jinja2 template in the expectations directory. Use SL_THIS as a placeholder for the target table name.
  2. Open the table YAML file -- Edit metadata/load/<domain>/<table>.sl.yml.
  3. Add an expectations section -- Under the table key, define a list of entries.
  4. Write the expectation expression -- Use the format <query_name>(<params>) => <condition>. Available condition variables are count, result and results.
  5. Set failOnError -- Set to true to halt the pipeline on failure, or false to log and continue.
  6. Run the load -- Execute starlake load. Expectations are evaluated after the data has been written.

Defining expectations in the table YAML

Add expectations in the expectations section of the table definition. Each entry contains an expect expression and an optional failOnError flag.

table:
...
attributes:
- name: id
type: integer
...
expectations:
- expect: "is_col_value_not_unique('id') => result(0) == 1"
failOnError: true # or false if you want to continue

Expectation expression format

The expectation expression follows this pattern:

<query_name>(<param>*) => <condition>
  • query_name -- Name of a Jinja2 macro defined in the expectations directory. The macro generates a SQL SELECT statement run against the target table.
  • param -- Parameters passed to the macro, separated by commas.
  • condition -- A boolean expression evaluated against the query result.

Condition variables

VariableTypeDescription
countLongNumber of rows in the result
resultSeq[Any]First row as a collection of values (one per column)
resultsSeq[Seq[Any]]All rows as a collection of rows, each row as a collection of values

Writing expectation query macros

Expectation queries are Jinja2 macros stored as .j2 files in the expectations directory. The SL_THIS placeholder represents the target table. You can organize macros in subdirectories.

Built-in macro examples

Starlake provides the following reusable macros. They are customizable and extensible.

{% macro is_col_value_not_unique(col, table='SL_THIS') %}
SELECT max(cnt)
FROM (SELECT {{ col }}, count(*) as cnt FROM {{ table }}
GROUP BY {{ col }}
HAVING cnt > 1)
{% endmacro %}

{% macro is_row_count_to_be_between(min_value, max_value, table_name = 'SL_THIS') -%}
select
case
when count(*) between {{min_value}} and {{max_value}} then 1
else
0
end
from {{table_name}}
{%- endmacro %}

{% macro col_value_count_greater_than(col, min_count, table_name='SL_THIS') %}
SELECT {{ col }}, count(*) FROM {{ table_name }}
GROUP BY {{ col }}
HAVING count(*) > {{ min_count }}
{% endmacro %}


{% macro count_by_value(col, value, table='SL_THIS') %}
SELECT count(*)
FROM {{ table }}
WHERE {{ col }} LIKE '{{ value }}'
{% endmacro %}


{% macro column_occurs(col, times, table='SL_THIS') %}
SELECT max(cnt)
FROM (
SELECT {{ col }}, count(*) as cnt FROM {{ table }}
GROUP BY {{ col }}
HAVING cnt == {{ times }}
)
{% endmacro %}

Creating your own macros

Create a .j2 file in the expectations directory. Define a Jinja2 macro that generates a SQL SELECT statement. Use SL_THIS as the table placeholder. The query result is evaluated against the condition in the expectation expression.

Expectations vs type validation

AspectType validationExpectations
When it runsBefore load (pre-load)After load (post-load)
What it checksIndividual field format (regex)Business rules on the full dataset
Failed recordsRejected to audit tablePipeline halted or warning logged
EngineSpark onlyAll engines

Use both for comprehensive data quality: type validation catches format errors at the record level, and expectations verify aggregate conditions on the loaded data. You can also base expectations on ingestion metrics.

Frequently Asked Questions

What is an expectation in Starlake?

An expectation is a post-load assertion executed after data loading. It consists of a SQL query evaluated on the target table, whose result is compared to an expected condition.

How do I define an expectation in the table YAML file?

Add an expectations section with a list of entries. Each entry contains expect (query name + condition) and optionally failOnError: true to halt the pipeline on failure.

What is the format of an expectation expression?

The format is <query_name>(<params>) => <condition>. The query_name references a Jinja template defined in the expectations directory. The condition uses the variables count, result or results.

How do I write an expectation query template?

Templates are Jinja2 macros with a .j2 extension placed in the expectations directory. They generate SQL. The SL_THIS placeholder represents the target table.

What variables are available in an expectation condition?

Three variables: count (number of rows in the result), result (first row as a collection of values), and results (all rows as a collection of collections).

Can I stop the pipeline if an expectation fails?

Yes. Set failOnError: true on the expectation. If the condition is not satisfied, the pipeline halts with an error.

What built-in expectation macros does Starlake provide?

Starlake documents: is_col_value_not_unique, is_row_count_to_be_between, col_value_count_greater_than, count_by_value and column_occurs. They are customizable and extensible.