Skip to main content

Ingestion Metrics

Starlake computes automatic data profiling statistics on each incoming batch during ingestion. Two metric types are available: continuous for numeric attributes and discrete for categorical attributes. Metrics are stored in dedicated tables with an ingestion timestamp, enabling comparison across loads.

Use ingestion metrics to monitor data quality trends, detect anomalies and feed post-load expectations with statistical baselines.

How to enable metrics

Set the metricType property on individual attributes in the table YAML file:

metadata/load/<domain>/<table>.sl.yml
table:
pattern: "business.*.csv"
attributes:
- name: "review_count"
type: "long"
metricType: continuous
- name: "city"
type: "string"
metricType: discrete
note

Only top-level attributes are supported for metric computation. Nested attributes are not supported.

Continuous metrics

When metricType is set to continuous, Starlake computes the following statistics on the attribute:

MetricDescription
minimumSmallest value
maximumLargest value
sumSum of all values
meanArithmetic average
medianMiddle value separating the upper and lower halves
varianceHow far values are spread out from the mean
standard deviationSquare root of the variance; measures spread of values
missing valuesCount of null or missing values
skewnessAsymmetry of the probability distribution. Negative skew: tail on the left. Positive skew: tail on the right.
kurtosisExtent to which the distribution is outlier-prone compared to a normal distribution. Higher kurtosis means heavier tails.
percentile 25Value below which 25% of values fall
percentile 75Value below which 75% of values fall
row countTotal number of rows in the batch

Discrete metrics

When metricType is set to discrete, Starlake computes the following statistics on the attribute:

MetricDescription
count distinctNumber of distinct values
category frequencyPercentage for each distinct value
category countNumber of occurrences per distinct value
row countTotal number of rows in the batch

Storage and timestamps

Each metric computation is scoped to the incoming batch only, not the full table. Starlake stores the results in dedicated metric tables with an ingestion timestamp. This allows you to compare metric values between successive loads and track data quality trends over time.

Example output

Assuming a file with attributes city (discrete) and review_count (continuous):

Discrete metrics table:

+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |countDistinct|missingValuesDiscrete|slMetric |jobId |domain|schema |count|timestamp |slStage |
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|city |53 |0 |Discrete |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+

Continuous metrics table:

+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |min|max |mean |missingValues|variance|standardDev|sum |skewness|kurtosis|percentile25|median|percentile75|slMetric |jobId |domain|schema |count|timestamp |slStage |
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|review_count|3.0|664.0|38.675|0 |7974.944|89.303 |7735.0|4.359 |21.423 |5.0 |9.0 |25.0 |Continuous |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+

Category frequency table:

+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|attribute|category |count|frequency|jobId |domain|schema |timestamp |slStage |
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|city |Tempe |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |North Las Vegas|200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Phoenix |200 |0.085 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |West Mifflin |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Newmarket |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wickliffe |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |McKeesport |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scottsdale |200 |0.06 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scarborough |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wexford |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Willoughby |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Chandler |200 |0.02 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Surprise |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cleveland |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Litchfield Park|200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Verona |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Richmond Hill |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Hudson |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Etobicoke |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cuyahoga Falls |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |.............. |... |..... |local-1650471634299|yelp |business|1650471642737|UNIT |
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+

Frequently Asked Questions

What types of metrics does Starlake compute during ingestion?

Two types: continuous (min, max, mean, median, variance, standard deviation, skewness, kurtosis, percentiles 25/75, missing values, row count) and discrete (count distinct, category frequency, category count, row count).

How do I enable metrics on an attribute?

Set the metricType property on the attribute in the table YAML file. Use continuous for numeric metrics or discrete for categorical metrics.

Where are ingestion metrics stored?

Metrics are stored in dedicated tables in the data warehouse, with an ingestion timestamp allowing you to compare values between successive loads.

Can I compute metrics on nested attributes?

No. Currently, only top-level attributes are supported for metric computation.

What continuous metrics are computed?

Minimum, maximum, sum, mean, median, variance, standard deviation, missing values, skewness, kurtosis, percentile 25, percentile 75 and row count.

What discrete metrics are computed?

Count distinct, category frequency (percentage), category count (number of occurrences per distinct value) and row count.

Are metrics computed on all data or only the incoming batch?

Only on the incoming batch. Each computation is associated with the ingestion timestamp, enabling tracking over time.