Data Retention & Scrape Cache Policy
Data Retention & Scrape Cache Policy
Available from v0.1.49
The platform includes built-in controls to manage how long raw scrape data is kept and how frequently a product can be re-scraped. These policies reduce unnecessary crawl load, prevent redundant Inngest function executions, and keep storage usage predictable.
Raw Scrape Data Retention
Raw scrape data (the unprocessed HTML, JSON payloads, and extracted text collected during a crawl) is retained for a configurable window before being automatically purged.
- Default retention period: 90 days
- Purging applies to raw scrape data only. Processed dossiers, opportunity scores, and product records are not affected.
- After purging, a product can be re-scraped at any time to refresh the raw data.
Minimum Re-scrape Interval
To prevent redundant re-scraping, the deep-scrape pipeline enforces a minimum interval between successive scrapes of the same product.
- Default minimum interval: 7 days
- The check is performed inside the deep-scrape Inngest function using the
last_scraped_atfield on the product record. - If
last_scraped_atis more recent than the minimum interval, the function exits early and no scrape is performed.
Manual Override
Any product can be force re-scraped regardless of the last_scraped_at value by triggering a manual override. Use this when:
- A supplier has recently updated their listing or pricing.
- You suspect the cached data is stale ahead of the normal interval.
- You are testing or debugging the scrape pipeline.
How the last_scraped_at Guard Works
The guard runs at the start of the deep-scrape Inngest function:
- The function reads the
last_scraped_attimestamp for the target product. - It compares the timestamp against the configured minimum interval (default: 7 days).
- If within the interval and no manual override flag is present → function exits early, scrape is skipped.
- If outside the interval or a manual override is present → scrape proceeds normally and
last_scraped_atis updated on completion.
Configuration Defaults
| Setting | Default | Description |
|---|---|---|
| Raw data retention | 90 days | How long raw scrape payloads are kept before purging |
| Minimum re-scrape interval | 7 days | Shortest time between automatic re-scrapes of the same product |
Notes
- The retention purge and the re-scrape interval are independent controls. A product with a recent
last_scraped_atwill still have its raw data purged after 90 days if it has not been re-scraped. - The minimum re-scrape interval protects against runaway re-scrape loops when many products are queued simultaneously.