Stale Data Detection

freshnessmedium

Asserts that no individual record in the table is older than the specified number of days without an update. Identifies records that may have been missed by incremental update pipelines or are stuck in a stale state.

v1.0.0by dqhub648 downloads4.4 (28)

freshnessstalenessstale-recordsdata-agingincremental

Try This Rule

Parameters

column_namestringrequired

The column containing email addresses

thresholdfloatdefault: 0.99

Minimum fraction of valid emails (0.0 to 1.0)

Install

soda

checks for {{table_name}}:
  - invalid_percent({{column_name}}) < {{(1 - threshold) * 100}}:
      valid regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

dbt

{% test valid_email(model, column_name) %}
select {{ column_name }}
from {{ model }}
where {{ column_name }} not regexp '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
{% endtest %}

sql

SELECT COUNT(*) as total,
  SUM(CASE WHEN {{column_name}} REGEXP
    '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
    THEN 1 ELSE 0 END) as valid
FROM {{table_name}}

Great Expectations

{
  "expectation_type": "expect_column_values_to_match_regex",
  "kwargs": {
    "column": "{{column_name}}",
    "regex": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
    "mostly": {{threshold}}
  }
}

spark

from pyspark.sql.functions import col
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
invalid = df.filter(~col("{{column_name}}").rlike(pattern)).count()

Test Data

Passing Examples

id	value
1	alice@example.com
2	bob.smith@company.co.uk
3	charlie+tag@domain.org

Failing Examples

id	value
1	not-an-email
2	@missing-local.com
3	spaces in@email.com

CLI

Terminal

npx dqhub install stale-data-detection --format soda --table YOUR_TABLE
npx dqhub install stale-data-detection --format dbt --model YOUR_MODEL
npx dqhub install stale-data-detection --format sql --dialect snowflake