audit_silver¶
Purpose¶
AuditSilverDataJob runs data quality audits on all silver-layer Delta tables using Great Expectations (GX). It samples data using a weighted reservoir algorithm, applies schema-derived validation rules, and produces HTML audit reports. This job is the primary data quality visualization for the silver layer.
I didn't include an audit gold job, but this could be easily done.
Entrypoint¶
| Method | Command |
|---|---|
| CLI | dip audit-silver |
| Airflow | get_tasks() returns one S3 path per Delta table; process_task(s3_path) audits a single table |
Audit reports can be viewed via dip audit-docs which serves the generated HTML locally.
How It Works¶
Job layer (AuditSilverDataJob)¶
- Discovery (
get_data_to_process): UsesCloudStorageClient.get_delta_tables(prefix=SILVER_DATA_FOLDER)to find all Delta tables undersilver/. Yields each table path. - Processing (
process_data): - Creates an
AuditMetadatawithdataset_stage='silver'. - Instantiates an
AuditRunnerand callsrunner.run(metadata). - On failure, captures the error in metadata.
- Always marks metadata as complete.
Core layer: AuditRunner¶
The AuditRunner in auditor/core/audit_runner.py orchestrates sampling, rule assembly, and execution:
- Model resolution: Resolves the data model and primary/partition keys via
ModelMapper. - Sampling: Instantiates
DeltaWeightedDataSamplerwith: weight_column: the partition key (typicallyiso2_country_code), used to ensure geographic diversity in the sample.target_total_rows: fromAUDIT_TOTAL_ROWS(default 1000).- Rule assembly: Creates a
DataAuditorwith the data model, dataset stage, and additional uniqueness rules for the primary key. - Data fetch: Calls
sampler.get_filtered_data(columns_filter=data_auditor.audit_columns)to retrieve only the columns needed for auditing. - Execution:
data_auditor.run(data=data_sample)runs the GX validation anddata_auditor.export_docs()generates HTML reports. - Metadata enrichment: Records sampling statistics (total raw records, sampled records, raw/sample distributions) and audit columns into the metadata.
Core layer: DeltaWeightedDataSampler¶
The sampler in auditor/io/delta_weighted_data_sampler.py implements two reservoir sampling algorithms:
- Weighted mode (A-Res): When
weightsare provided, usesscore = random(0,1) ^ (1/weight)to bias the sample toward higher-weight groups without excluding rare ones. - Standard mode (Algorithm R): When no weights are provided (the default for silver audits), all records have equal probability.
Key properties:
- Memory-efficient: Maintains a fixed-size heap of
target_total_rowsentries regardless of stream size. - Streaming: Processes data in batches from the Delta table without materializing the full dataset.
- Distribution tracking: Tracks per-group counts for both the raw data and the sample, enabling distribution comparison.
Core layer: DataAuditor¶
The DataAuditor in auditor/core/data_auditor.py manages GX expectations:
- Expectation definitions: Builds a list of pattern-matched rules using
fnmatchagainst the PyArrow schema column names. Rules include: - Primary key not-null and uniqueness checks (critical severity).
- ISO2 country code: length = 2, not-null at 98% (critical).
- Country name: not-null at 95% (warning).
- URL fields: not-null thresholds and regex validation (warning).
- Employee count: range 0 to 10M (warning).
- Founded year: range 1000 to current year + 1 (warning).
- Revenue fields: range 0 to 10^15 (warning).
- Description/text fields: not-null at 35% (info).
- Expectation binding: Each template (
ModelExpectationTemplate) is applied to matching columns, producingModelExpectationinstances that wrap actual GX expectation objects. - Execution: Converts PyArrow data to pandas (GX's pandas datasource), runs the validation suite, and processes results by severity level.
- Result classification:
- Exception failures: hard stop.
- Critical failures: hard stop.
- Warning failures: pass with warnings.
- Info failures: pass with info.
Metadata¶
| Field | Description |
|---|---|
s3_path |
Delta table path being audited |
dataset_stage |
'bronze', 'silver', or 'gold' |
total_raw_records |
Total rows in the Delta table before sampling |
total_sampled_records |
Number of rows in the sample passed to Great Expectations |
raw_data_distribution |
Per-partition-key value counts across the full table |
sample_data_distribution |
Per-partition-key value counts in the sample |
audit_columns |
List of columns that were audited |
error_message |
Set if the audit raised an exception; None on success |
Variables¶
| Variable | Default | Description |
|---|---|---|
AUDIT_TOTAL_ROWS |
1000 |
Target reservoir size for the weighted sampler — controls how many rows are audited per table |
fnmatchpattern matching depends on the flat naming convention: Expectation rules target columns by patterns likedesc__description_*or*revenue*. If the silver schema changes its naming convention (e.g. drops the prefix), rules will silently stop matching and those columns will go unaudited. There is no warning for zero-match patterns.