v1.0.0: The Web Scraping Bot Engine — What's Missing and Why It Matters

Published: v1.0.0
Status: 🔴 Critical Gap Identified

The Core Promise

HMOwatch exists to protect letting agents and landlords from costly HMO licensing compliance failures. The platform's primary mechanism is a fleet of automated bots that continuously scrape and monitor all 400+ UK local authority (LA) licensing sources, converting fragmented, ever-changing regulation into a single real-time alert system.

This document describes the current state of that mechanism — and the work still required to deliver it.

Current State

As of v1.0.0, the scraping bot engine does not exist in the codebase. A full audit found:

Component	Status
Headless browser driver (Playwright / Puppeteer)	❌ Not present
HTTP crawler / spider	❌ Not present
HTML parser (cheerio / JSDOM)	❌ Not present
Scraping background functions (Inngest)	❌ Not present
Bot orchestration / scheduling layer	❌ Not present
Proxy pool / IP rotation	❌ Not present
Rate-limit & back-off handling	❌ Not present
Change-detection / diffing pipeline	❌ Not present
Alert trigger hooks from scraped data	❌ Not present

Why This Is Blocking

Every user-facing feature of HMOwatch — licensing alerts, compliance dashboards, regulation change notifications — depends on a live, continuously updated feed of data from LA websites. Without the scraping layer, there is no data source. No other component can compensate for its absence.

What Needs to Be Built

1. Per-LA Scraper Modules

Each of the 400+ local authority sources will require its own extraction strategy. Some publish licensing information as static HTML; others render it dynamically via JavaScript frameworks. A headless browser driver (Playwright is recommended) is needed to handle both cases reliably.

2. Scheduling & Orchestration

Scrapes must run on a continuous schedule — not once, but repeatedly — to detect changes as soon as they occur. A background job framework such as Inngest is the intended orchestration layer. It must:

Schedule scrapes per LA at appropriate intervals
Handle retries and failure recovery
Track last-scraped timestamps and next-run schedules
Prioritise sources with a history of frequent changes

3. Proxy & Rate-Limit Handling

Scraping 400+ sources at scale will trigger bot-detection mechanisms. The engine needs:

A proxy pool or IP-rotation strategy
Per-domain rate limiting and configurable back-off
User-agent rotation and request header normalisation
Graceful handling of CAPTCHAs and access blocks

4. Ingestion & Change Detection

Raw scraped content must be processed before it becomes useful:

Normalise HTML/JSON into a structured internal schema
Diff new content against previously stored snapshots
Flag meaningful regulation changes (new licence types, boundary changes, fee updates, etc.)
Suppress noise from cosmetic or irrelevant page changes

5. Alert Trigger Hooks

Once a meaningful change is detected, the pipeline must emit events that downstream systems (email alerts, in-app notifications, webhook delivery) can consume.

Recommended Technology Choices

Need	Recommended Tool	Rationale
Headless browsing	Playwright	Handles JS-rendered pages; strong stealth and resilience options
HTML parsing	cheerio	Lightweight, jQuery-like API for static HTML extraction
Job scheduling	Inngest	Already referenced in the platform architecture; supports retries, concurrency, fan-out
Proxy management	Bright Data / Oxylabs	Enterprise-grade residential proxy pools suited to public-sector scraping

Summary

The scraping bot engine is the single most important missing component in HMOwatch. Until it is built, the platform cannot monitor any local authority source and cannot deliver its core value proposition. This gap is classified as critical and must be addressed before any other feature work is considered complete.