Data Retention & Scrape Cache Policy
Data Retention & Scrape Cache Policy
Introduced in v0.1.50, the platform now enforces two complementary policies to manage how long raw scrape data is kept and how frequently a product can be re-scraped.
Raw Data Retention
Raw scrape data (HTML payloads, extracted text, and intermediate artefacts) is retained for a configurable window before being automatically purged.
- Default: 90 days
- Purging removes raw input data only — processed dossiers, scores, and analysis results are not affected.
- This policy prevents unbounded database growth as the crawler indexes more products over time.
Minimum Re-scrape Interval
To prevent redundant work, the deep-scrape pipeline will refuse to re-process a product that has already been scraped within the minimum interval.
- Default: 7 days
- The check is implemented as a
last_scraped_atguard inside the deep-scrape Inngest function. When a scrape job is triggered, the function readslast_scraped_atfor the target product. If the timestamp falls within the minimum interval, the job exits early without performing any network requests or analysis. - This protects against accidental duplicate runs, webhook retries, and bulk re-queue operations.
Manual Override
The minimum interval can be bypassed on a per-product basis. Use the manual override when you need a forced refresh — for example, after a product updates its pricing or feature set.
How to trigger a manual override: Initiate a scrape from the product dossier page and enable the Force re-scrape option. This sets a bypass flag that skips the
last_scraped_atguard for that single run.
How the Two Policies Work Together
Scrape job triggered
│
▼
┌───────────────────────┐
│ last_scraped_at ≤ 7d │──► Skip (no-op) unless manual override
└───────────────────────┘
│ (>7 days OR override)
▼
Execute scrape
│
▼
Store raw data
│
▼
Purge after 90 days
- The re-scrape guard controls write frequency — how often fresh data enters the system.
- The retention policy controls write longevity — how long that data persists before being cleaned up.
Configuration Defaults
| Setting | Default | Notes |
|---|---|---|
DATA_RETENTION_DAYS | 90 | Days before raw scrape data is purged |
MIN_RESCRAPE_INTERVAL_DAYS | 7 | Minimum days between automated scrapes of the same product |
Contact an admin to adjust these values for your deployment.