Data Retention & Scrape Cache Policy
Data Retention & Scrape Cache Policy
Introduced in v0.1.51
This release introduces two related policies that govern how long raw scrape data is kept and how frequently a product can be re-scraped.
Raw Scrape Data Retention
All raw data collected during a scrape (review text, ratings, pricing signals, feature mentions, etc.) is stored temporarily and then automatically purged once the retention window expires.
Default
The default retention period is 90 days.
Purpose
- Prevents unbounded database growth from accumulating historical raw payloads.
- Ensures that only relatively recent source data influences dossiers and scores.
- Keeps storage costs predictable as the product directory corpus scales.
Note: Purging raw scrape data does not delete processed dossiers, scores, or any derived analysis — only the original scraped source payload is removed.
Minimum Re-scrape Interval (Scrape Cache Guard)
To prevent redundant work, the deep-scrape pipeline enforces a minimum interval between successive scrapes of the same product.
Default
The minimum re-scrape interval is 7 days.
How It Works
- When a deep-scrape Inngest job is triggered for a product, the function first checks the product's
last_scraped_attimestamp. - If the product was last scraped within the minimum interval, the job exits early and no scrape is performed.
- If the product has not been scraped recently (or has never been scraped), the full deep-scrape proceeds as normal.
last_scraped_atis updated on every successful scrape completion.
Manual Override
The interval guard can be bypassed on a per-product basis when you need a forced, immediate refresh — for example, after a supplier updates their pricing or feature set. Triggering a scrape with the manual override flag skips the last_scraped_at check entirely.
Configuration Defaults Summary
| Policy | Default | Override Available |
|---|---|---|
| Raw scrape data retention | 90 days | — |
| Minimum re-scrape interval | 7 days | Yes (per-product) |
Technical Detail
The last_scraped_at guard lives inside the deep-scrape Inngest function. The check is performed as an early-exit condition before any HTTP requests are made to the upstream directory or product pages, meaning no external traffic is generated for skipped jobs.