v0.1.49: Smarter Scraping with Data Retention & Cache Policy
v0.1.49: Smarter Scraping with Data Retention & Cache Policy
Release v0.1.49 introduces two complementary infrastructure improvements that make the scrape pipeline more efficient and the underlying data store easier to manage: configurable raw data retention and a minimum re-scrape interval guard.
The Problem
As the platform crawls more supplier directories and deep-analyses more products, two friction points emerge:
-
Raw scrape data accumulates without bound. Every crawl stores HTML payloads, extracted text, and intermediate JSON. Without a purge policy, storage grows continuously even for products that haven't changed.
-
Products can be re-scraped repeatedly. When a directory is re-processed or a product appears in multiple listing pages, the deep-scrape Inngest function could trigger multiple times for the same product in quick succession — burning credits, adding latency, and producing no new signal.
What Changed
90-Day Raw Data Retention
Raw scrape data now has a default lifetime of 90 days. After that window, the raw payloads are purged automatically. Processed dossiers, opportunity scores, and product records are untouched — only the raw crawl artefacts are cleaned up.
This keeps the data layer lean while preserving everything that actually drives the dashboard and dossier views.
7-Day Scrape Cache via last_scraped_at
The deep-scrape Inngest function now checks a last_scraped_at timestamp before doing any work. If the product was scraped within the last 7 days, the function exits early — no HTTP requests, no parsing, no cost.
The guard is intentionally lightweight: it lives at the very start of the function, so the short-circuit is as cheap as a single database read.
When you genuinely need fresh data — because a supplier just updated their pricing, or you're debugging a dossier — a manual override bypasses the interval check entirely.
Configuration at a Glance
| Setting | Default |
|---|---|
| Raw scrape retention | 90 days |
| Minimum re-scrape interval | 7 days |
What This Means in Practice
- Routine directory crawls will skip products already analysed in the past week, so large directory re-runs complete faster and don't re-queue known products.
- Storage costs stay predictable as the product catalogue grows — raw data from old scrapes is cleaned up on a rolling basis.
- Manual re-scrapes remain fully on-demand whenever you need the latest data from a supplier's listing.
This release is part of ongoing work to make the crawl pipeline reliable and cost-efficient at scale as the number of tracked products grows.