All Docs
FeaturesAgentOS Scope OutUpdated March 12, 2026

Data Retention & Scrape Cache Policy

Data Retention & Scrape Cache Policy

Introduced in v0.1.50, the platform now enforces two complementary policies to manage how long raw scrape data is kept and how frequently a product can be re-scraped.


Raw Data Retention

Raw scrape data (HTML payloads, extracted text, and intermediate artefacts) is retained for a configurable window before being automatically purged.

  • Default: 90 days
  • Purging removes raw input data only — processed dossiers, scores, and analysis results are not affected.
  • This policy prevents unbounded database growth as the crawler indexes more products over time.

Minimum Re-scrape Interval

To prevent redundant work, the deep-scrape pipeline will refuse to re-process a product that has already been scraped within the minimum interval.

  • Default: 7 days
  • The check is implemented as a last_scraped_at guard inside the deep-scrape Inngest function. When a scrape job is triggered, the function reads last_scraped_at for the target product. If the timestamp falls within the minimum interval, the job exits early without performing any network requests or analysis.
  • This protects against accidental duplicate runs, webhook retries, and bulk re-queue operations.

Manual Override

The minimum interval can be bypassed on a per-product basis. Use the manual override when you need a forced refresh — for example, after a product updates its pricing or feature set.

How to trigger a manual override: Initiate a scrape from the product dossier page and enable the Force re-scrape option. This sets a bypass flag that skips the last_scraped_at guard for that single run.


How the Two Policies Work Together

Scrape job triggered
        │
        ▼
┌───────────────────────┐
│  last_scraped_at ≤ 7d │──► Skip (no-op) unless manual override
└───────────────────────┘
        │ (>7 days OR override)
        ▼
   Execute scrape
        │
        ▼
  Store raw data
        │
        ▼
  Purge after 90 days
  • The re-scrape guard controls write frequency — how often fresh data enters the system.
  • The retention policy controls write longevity — how long that data persists before being cleaned up.

Configuration Defaults

SettingDefaultNotes
DATA_RETENTION_DAYS90Days before raw scrape data is purged
MIN_RESCRAPE_INTERVAL_DAYS7Minimum days between automated scrapes of the same product

Contact an admin to adjust these values for your deployment.