Row Count Anomaly Detection

volumemedium

Asserts that the current row count is within the specified number of standard deviations from the historical average. Uses statistical anomaly detection to catch unexpected volume spikes or drops without requiring hard-coded thresholds.

v1.0.0by dqhub363 downloads5 (14)

volumerow-countanomaly-detectionstatisticsz-scoremonitoring

Try This Rule

Parameters

column_namestringrequired

The column containing email addresses

thresholdfloatdefault: 0.99

Minimum fraction of valid emails (0.0 to 1.0)

Install

soda

checks for {{table_name}}:
  - invalid_percent({{column_name}}) < {{(1 - threshold) * 100}}:
      valid regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

dbt

{% test valid_email(model, column_name) %}
select {{ column_name }}
from {{ model }}
where {{ column_name }} not regexp '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
{% endtest %}

sql

SELECT COUNT(*) as total,
  SUM(CASE WHEN {{column_name}} REGEXP
    '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
    THEN 1 ELSE 0 END) as valid
FROM {{table_name}}

Great Expectations

{
  "expectation_type": "expect_column_values_to_match_regex",
  "kwargs": {
    "column": "{{column_name}}",
    "regex": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
    "mostly": {{threshold}}
  }
}

spark

from pyspark.sql.functions import col
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
invalid = df.filter(~col("{{column_name}}").rlike(pattern)).count()

Test Data

Passing Examples

id	value
1	alice@example.com
2	bob.smith@company.co.uk
3	charlie+tag@domain.org

Failing Examples

id	value
1	not-an-email
2	@missing-local.com
3	spaces in@email.com

CLI

Terminal

npx dqhub install row-count-anomaly --format soda --table YOUR_TABLE
npx dqhub install row-count-anomaly --format dbt --model YOUR_MODEL
npx dqhub install row-count-anomaly --format sql --dialect snowflake