All Docs
FeaturesHMOwatchUpdated February 20, 2026

v1.0.0: The Web Scraping Bot Engine — What's Missing and Why It Matters

v1.0.0: The Web Scraping Bot Engine — What's Missing and Why It Matters

Published: v1.0.0
Status: 🔴 Critical Gap Identified


The Core Promise

HMOwatch exists to protect letting agents and landlords from costly HMO licensing compliance failures. The platform's primary mechanism is a fleet of automated bots that continuously scrape and monitor all 400+ UK local authority (LA) licensing sources, converting fragmented, ever-changing regulation into a single real-time alert system.

This document describes the current state of that mechanism — and the work still required to deliver it.


Current State

As of v1.0.0, the scraping bot engine does not exist in the codebase. A full audit found:

ComponentStatus
Headless browser driver (Playwright / Puppeteer)❌ Not present
HTTP crawler / spider❌ Not present
HTML parser (cheerio / JSDOM)❌ Not present
Scraping background functions (Inngest)❌ Not present
Bot orchestration / scheduling layer❌ Not present
Proxy pool / IP rotation❌ Not present
Rate-limit & back-off handling❌ Not present
Change-detection / diffing pipeline❌ Not present
Alert trigger hooks from scraped data❌ Not present

Why This Is Blocking

Every user-facing feature of HMOwatch — licensing alerts, compliance dashboards, regulation change notifications — depends on a live, continuously updated feed of data from LA websites. Without the scraping layer, there is no data source. No other component can compensate for its absence.


What Needs to Be Built

1. Per-LA Scraper Modules

Each of the 400+ local authority sources will require its own extraction strategy. Some publish licensing information as static HTML; others render it dynamically via JavaScript frameworks. A headless browser driver (Playwright is recommended) is needed to handle both cases reliably.

2. Scheduling & Orchestration

Scrapes must run on a continuous schedule — not once, but repeatedly — to detect changes as soon as they occur. A background job framework such as Inngest is the intended orchestration layer. It must:

  • Schedule scrapes per LA at appropriate intervals
  • Handle retries and failure recovery
  • Track last-scraped timestamps and next-run schedules
  • Prioritise sources with a history of frequent changes

3. Proxy & Rate-Limit Handling

Scraping 400+ sources at scale will trigger bot-detection mechanisms. The engine needs:

  • A proxy pool or IP-rotation strategy
  • Per-domain rate limiting and configurable back-off
  • User-agent rotation and request header normalisation
  • Graceful handling of CAPTCHAs and access blocks

4. Ingestion & Change Detection

Raw scraped content must be processed before it becomes useful:

  • Normalise HTML/JSON into a structured internal schema
  • Diff new content against previously stored snapshots
  • Flag meaningful regulation changes (new licence types, boundary changes, fee updates, etc.)
  • Suppress noise from cosmetic or irrelevant page changes

5. Alert Trigger Hooks

Once a meaningful change is detected, the pipeline must emit events that downstream systems (email alerts, in-app notifications, webhook delivery) can consume.


Recommended Technology Choices

NeedRecommended ToolRationale
Headless browsingPlaywrightHandles JS-rendered pages; strong stealth and resilience options
HTML parsingcheerioLightweight, jQuery-like API for static HTML extraction
Job schedulingInngestAlready referenced in the platform architecture; supports retries, concurrency, fan-out
Proxy managementBright Data / OxylabsEnterprise-grade residential proxy pools suited to public-sector scraping

Summary

The scraping bot engine is the single most important missing component in HMOwatch. Until it is built, the platform cannot monitor any local authority source and cannot deliver its core value proposition. This gap is classified as critical and must be addressed before any other feature work is considered complete.