Ingestion Metrics
Starlake computes automatic data profiling statistics on each incoming batch during ingestion. Two metric types are available: continuous for numeric attributes and discrete for categorical attributes. Metrics are stored in dedicated tables with an ingestion timestamp, enabling comparison across loads.
Use ingestion metrics to monitor data quality trends, detect anomalies and feed post-load expectations with statistical baselines.
How to enable metrics
Set the metricType property on individual attributes in the table YAML file:
table:
pattern: "business.*.csv"
attributes:
- name: "review_count"
type: "long"
metricType: continuous
- name: "city"
type: "string"
metricType: discrete
Only top-level attributes are supported for metric computation. Nested attributes are not supported.
Continuous metrics
When metricType is set to continuous, Starlake computes the following statistics on the attribute:
| Metric | Description |
|---|---|
| minimum | Smallest value |
| maximum | Largest value |
| sum | Sum of all values |
| mean | Arithmetic average |
| median | Middle value separating the upper and lower halves |
| variance | How far values are spread out from the mean |
| standard deviation | Square root of the variance; measures spread of values |
| missing values | Count of null or missing values |
| skewness | Asymmetry of the probability distribution. Negative skew: tail on the left. Positive skew: tail on the right. |
| kurtosis | Extent to which the distribution is outlier-prone compared to a normal distribution. Higher kurtosis means heavier tails. |
| percentile 25 | Value below which 25% of values fall |
| percentile 75 | Value below which 75% of values fall |
| row count | Total number of rows in the batch |
Discrete metrics
When metricType is set to discrete, Starlake computes the following statistics on the attribute:
| Metric | Description |
|---|---|
| count distinct | Number of distinct values |
| category frequency | Percentage for each distinct value |
| category count | Number of occurrences per distinct value |
| row count | Total number of rows in the batch |
Storage and timestamps
Each metric computation is scoped to the incoming batch only, not the full table. Starlake stores the results in dedicated metric tables with an ingestion timestamp. This allows you to compare metric values between successive loads and track data quality trends over time.
Example output
Assuming a file with attributes city (discrete) and review_count (continuous):
Discrete metrics table:
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |countDistinct|missingValuesDiscrete|slMetric |jobId |domain|schema |count|timestamp |slStage |
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
|city |53 |0 |Discrete |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+-----------+-------------+---------------------+-----------+-------------------+------+--------+-----+-------------+----------+
Continuous metrics table:
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|attribute |min|max |mean |missingValues|variance|standardDev|sum |skewness|kurtosis|percentile25|median|percentile75|slMetric |jobId |domain|schema |count|timestamp |slStage |
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
|review_count|3.0|664.0|38.675|0 |7974.944|89.303 |7735.0|4.359 |21.423 |5.0 |9.0 |25.0 |Continuous |local-1650471634299|yelp |business|200 |1650471642737|UNIT |
+------------+---+-----+------+-------------+--------+-----------+------+--------+--------+------------+------+------------+-----------+-------------------+------+--------+-----+-------------+----------+
Category frequency table:
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|attribute|category |count|frequency|jobId |domain|schema |timestamp |slStage |
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
|city |Tempe |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |North Las Vegas|200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Phoenix |200 |0.085 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |West Mifflin |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Newmarket |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wickliffe |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |McKeesport |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scottsdale |200 |0.06 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Scarborough |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Wexford |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Willoughby |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Chandler |200 |0.02 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Surprise |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cleveland |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Litchfield Park|200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Verona |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Richmond Hill |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Hudson |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Etobicoke |200 |0.01 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |Cuyahoga Falls |200 |0.005 |local-1650471634299|yelp |business|1650471642737|UNIT |
|city |.............. |... |..... |local-1650471634299|yelp |business|1650471642737|UNIT |
+---------+---------------+-----+---------+-------------------+------+--------+-------------+----------+
Frequently Asked Questions
What types of metrics does Starlake compute during ingestion?
Two types: continuous (min, max, mean, median, variance, standard deviation, skewness, kurtosis, percentiles 25/75, missing values, row count) and discrete (count distinct, category frequency, category count, row count).
How do I enable metrics on an attribute?
Set the metricType property on the attribute in the table YAML file. Use continuous for numeric metrics or discrete for categorical metrics.
Where are ingestion metrics stored?
Metrics are stored in dedicated tables in the data warehouse, with an ingestion timestamp allowing you to compare values between successive loads.
Can I compute metrics on nested attributes?
No. Currently, only top-level attributes are supported for metric computation.
What continuous metrics are computed?
Minimum, maximum, sum, mean, median, variance, standard deviation, missing values, skewness, kurtosis, percentile 25, percentile 75 and row count.
What discrete metrics are computed?
Count distinct, category frequency (percentage), category count (number of occurrences per distinct value) and row count.
Are metrics computed on all data or only the incoming batch?
Only on the incoming batch. Each computation is associated with the ingestion timestamp, enabling tracking over time.