All Docs
FeaturesAgentOS Scope OutUpdated March 12, 2026

Data Retention & Scrape Cache Policy

Data Retention & Scrape Cache Policy

Available from v0.1.49

The platform includes built-in controls to manage how long raw scrape data is kept and how frequently a product can be re-scraped. These policies reduce unnecessary crawl load, prevent redundant Inngest function executions, and keep storage usage predictable.


Raw Scrape Data Retention

Raw scrape data (the unprocessed HTML, JSON payloads, and extracted text collected during a crawl) is retained for a configurable window before being automatically purged.

  • Default retention period: 90 days
  • Purging applies to raw scrape data only. Processed dossiers, opportunity scores, and product records are not affected.
  • After purging, a product can be re-scraped at any time to refresh the raw data.

Minimum Re-scrape Interval

To prevent redundant re-scraping, the deep-scrape pipeline enforces a minimum interval between successive scrapes of the same product.

  • Default minimum interval: 7 days
  • The check is performed inside the deep-scrape Inngest function using the last_scraped_at field on the product record.
  • If last_scraped_at is more recent than the minimum interval, the function exits early and no scrape is performed.

Manual Override

Any product can be force re-scraped regardless of the last_scraped_at value by triggering a manual override. Use this when:

  • A supplier has recently updated their listing or pricing.
  • You suspect the cached data is stale ahead of the normal interval.
  • You are testing or debugging the scrape pipeline.

How the last_scraped_at Guard Works

The guard runs at the start of the deep-scrape Inngest function:

  1. The function reads the last_scraped_at timestamp for the target product.
  2. It compares the timestamp against the configured minimum interval (default: 7 days).
  3. If within the interval and no manual override flag is present → function exits early, scrape is skipped.
  4. If outside the interval or a manual override is present → scrape proceeds normally and last_scraped_at is updated on completion.

Configuration Defaults

SettingDefaultDescription
Raw data retention90 daysHow long raw scrape payloads are kept before purging
Minimum re-scrape interval7 daysShortest time between automatic re-scrapes of the same product

Notes

  • The retention purge and the re-scrape interval are independent controls. A product with a recent last_scraped_at will still have its raw data purged after 90 days if it has not been re-scraped.
  • The minimum re-scrape interval protects against runaway re-scrape loops when many products are queued simultaneously.