Data Retention & Scrape Cache Policy

Introduced in v0.1.51

This release introduces two related policies that govern how long raw scrape data is kept and how frequently a product can be re-scraped.

Raw Scrape Data Retention

All raw data collected during a scrape (review text, ratings, pricing signals, feature mentions, etc.) is stored temporarily and then automatically purged once the retention window expires.

Default

The default retention period is 90 days.

Purpose

Prevents unbounded database growth from accumulating historical raw payloads.
Ensures that only relatively recent source data influences dossiers and scores.
Keeps storage costs predictable as the product directory corpus scales.

Note: Purging raw scrape data does not delete processed dossiers, scores, or any derived analysis — only the original scraped source payload is removed.

Minimum Re-scrape Interval (Scrape Cache Guard)

To prevent redundant work, the deep-scrape pipeline enforces a minimum interval between successive scrapes of the same product.

Default

The minimum re-scrape interval is 7 days.

How It Works

When a deep-scrape Inngest job is triggered for a product, the function first checks the product's last_scraped_at timestamp.
If the product was last scraped within the minimum interval, the job exits early and no scrape is performed.
If the product has not been scraped recently (or has never been scraped), the full deep-scrape proceeds as normal.
last_scraped_at is updated on every successful scrape completion.

Manual Override

The interval guard can be bypassed on a per-product basis when you need a forced, immediate refresh — for example, after a supplier updates their pricing or feature set. Triggering a scrape with the manual override flag skips the last_scraped_at check entirely.

Configuration Defaults Summary

Policy	Default	Override Available
Raw scrape data retention	90 days	—
Minimum re-scrape interval	7 days	Yes (per-product)

Technical Detail

The last_scraped_at guard lives inside the deep-scrape Inngest function. The check is performed as an early-exit condition before any HTTP requests are made to the upstream directory or product pages, meaning no external traffic is generated for skipped jobs.