v0.1.49: Smarter Scraping with Data Retention & Cache Policy

Release v0.1.49 introduces two complementary infrastructure improvements that make the scrape pipeline more efficient and the underlying data store easier to manage: configurable raw data retention and a minimum re-scrape interval guard.

The Problem

As the platform crawls more supplier directories and deep-analyses more products, two friction points emerge:

Raw scrape data accumulates without bound. Every crawl stores HTML payloads, extracted text, and intermediate JSON. Without a purge policy, storage grows continuously even for products that haven't changed.
Products can be re-scraped repeatedly. When a directory is re-processed or a product appears in multiple listing pages, the deep-scrape Inngest function could trigger multiple times for the same product in quick succession — burning credits, adding latency, and producing no new signal.

What Changed

90-Day Raw Data Retention

Raw scrape data now has a default lifetime of 90 days. After that window, the raw payloads are purged automatically. Processed dossiers, opportunity scores, and product records are untouched — only the raw crawl artefacts are cleaned up.

This keeps the data layer lean while preserving everything that actually drives the dashboard and dossier views.

7-Day Scrape Cache via `last_scraped_at`

The deep-scrape Inngest function now checks a last_scraped_at timestamp before doing any work. If the product was scraped within the last 7 days, the function exits early — no HTTP requests, no parsing, no cost.

The guard is intentionally lightweight: it lives at the very start of the function, so the short-circuit is as cheap as a single database read.

When you genuinely need fresh data — because a supplier just updated their pricing, or you're debugging a dossier — a manual override bypasses the interval check entirely.

Configuration at a Glance

Setting	Default
Raw scrape retention	90 days
Minimum re-scrape interval	7 days

What This Means in Practice

Routine directory crawls will skip products already analysed in the past week, so large directory re-runs complete faster and don't re-queue known products.
Storage costs stay predictable as the product catalogue grows — raw data from old scrapes is cleaned up on a rolling basis.
Manual re-scrapes remain fully on-demand whenever you need the latest data from a supplier's listing.

This release is part of ongoing work to make the crawl pipeline reliable and cost-efficient at scale as the number of tracked products grows.

v0.1.49: Smarter Scraping with Data Retention & Cache Policy

v0.1.49: Smarter Scraping with Data Retention & Cache Policy

The Problem

What Changed

90-Day Raw Data Retention

7-Day Scrape Cache via last_scraped_at

Configuration at a Glance

What This Means in Practice

7-Day Scrape Cache via `last_scraped_at`