v1.0.0: The Web Scraping Bot Engine — What's Missing and Why It Matters
v1.0.0: The Web Scraping Bot Engine — What's Missing and Why It Matters
Published: v1.0.0
Status: 🔴 Critical Gap Identified
The Core Promise
HMOwatch exists to protect letting agents and landlords from costly HMO licensing compliance failures. The platform's primary mechanism is a fleet of automated bots that continuously scrape and monitor all 400+ UK local authority (LA) licensing sources, converting fragmented, ever-changing regulation into a single real-time alert system.
This document describes the current state of that mechanism — and the work still required to deliver it.
Current State
As of v1.0.0, the scraping bot engine does not exist in the codebase. A full audit found:
| Component | Status |
|---|---|
| Headless browser driver (Playwright / Puppeteer) | ❌ Not present |
| HTTP crawler / spider | ❌ Not present |
| HTML parser (cheerio / JSDOM) | ❌ Not present |
| Scraping background functions (Inngest) | ❌ Not present |
| Bot orchestration / scheduling layer | ❌ Not present |
| Proxy pool / IP rotation | ❌ Not present |
| Rate-limit & back-off handling | ❌ Not present |
| Change-detection / diffing pipeline | ❌ Not present |
| Alert trigger hooks from scraped data | ❌ Not present |
Why This Is Blocking
Every user-facing feature of HMOwatch — licensing alerts, compliance dashboards, regulation change notifications — depends on a live, continuously updated feed of data from LA websites. Without the scraping layer, there is no data source. No other component can compensate for its absence.
What Needs to Be Built
1. Per-LA Scraper Modules
Each of the 400+ local authority sources will require its own extraction strategy. Some publish licensing information as static HTML; others render it dynamically via JavaScript frameworks. A headless browser driver (Playwright is recommended) is needed to handle both cases reliably.
2. Scheduling & Orchestration
Scrapes must run on a continuous schedule — not once, but repeatedly — to detect changes as soon as they occur. A background job framework such as Inngest is the intended orchestration layer. It must:
- Schedule scrapes per LA at appropriate intervals
- Handle retries and failure recovery
- Track last-scraped timestamps and next-run schedules
- Prioritise sources with a history of frequent changes
3. Proxy & Rate-Limit Handling
Scraping 400+ sources at scale will trigger bot-detection mechanisms. The engine needs:
- A proxy pool or IP-rotation strategy
- Per-domain rate limiting and configurable back-off
- User-agent rotation and request header normalisation
- Graceful handling of CAPTCHAs and access blocks
4. Ingestion & Change Detection
Raw scraped content must be processed before it becomes useful:
- Normalise HTML/JSON into a structured internal schema
- Diff new content against previously stored snapshots
- Flag meaningful regulation changes (new licence types, boundary changes, fee updates, etc.)
- Suppress noise from cosmetic or irrelevant page changes
5. Alert Trigger Hooks
Once a meaningful change is detected, the pipeline must emit events that downstream systems (email alerts, in-app notifications, webhook delivery) can consume.
Recommended Technology Choices
| Need | Recommended Tool | Rationale |
|---|---|---|
| Headless browsing | Playwright | Handles JS-rendered pages; strong stealth and resilience options |
| HTML parsing | cheerio | Lightweight, jQuery-like API for static HTML extraction |
| Job scheduling | Inngest | Already referenced in the platform architecture; supports retries, concurrency, fan-out |
| Proxy management | Bright Data / Oxylabs | Enterprise-grade residential proxy pools suited to public-sector scraping |
Summary
The scraping bot engine is the single most important missing component in HMOwatch. Until it is built, the platform cannot monitor any local authority source and cannot deliver its core value proposition. This gap is classified as critical and must be addressed before any other feature work is considered complete.