All Docs
FeaturesHMOwatchUpdated March 15, 2026

HMO Register Aggregation Pipeline

HMO Register Aggregation Pipeline

HMOwatch continuously scrapes UK local authority public HMO registers, extracts structured records using AI, and presents them in a searchable national dashboard. This page documents how the pipeline works, how to configure data sources, and how to query the data via the API.

Overview

The pipeline consists of four layers:

  1. Source registry — a database table of known council register URLs.
  2. Background jobs — scheduled and event-driven Inngest functions that fetch and process each source.
  3. AI extraction — Claude Haiku parses raw HTML into structured records.
  4. API + UI — tRPC endpoints and a Next.js dashboard page expose the aggregated data.
[Council Register URLs]
        │
        ▼
[hmo-register-scraper]  ← daily cron @ 02:00 UTC
        │  fan-out per source
        ▼
[hmo-register-source-scraper]  ← event-driven
        │  fetch HTML → Claude Haiku extraction
        ▼
[hmo_register_entries table]  ← upserted, SHA-256 deduplicated
        │
        ▼
[tRPC hmoRegister router]  ←  /dashboard/register UI

Database Tables

council_register_sources

Tracks every known public HMO register URL.

ColumnTypeDescription
iduuidPrimary key
councilIdtextReference to the parent council
urltextPublic register URL to scrape
statusenumactive | inactive | needs_review
dataFormatHinttextOptional hint to guide AI extraction
lastScrapedAttimestampWhen the source was last successfully scraped
lastErrorAttimestampWhen the source last errored
lastErrorMessagetextMost recent error detail

hmo_register_entries

Stores individual HMO property/landlord records extracted from council registers.

ColumnTypeDescription
iduuidPrimary key
councilIdtextOwning council
addressLine1textProperty address
postcodetextProperty postcode
landlordNametextLicence holder name
licenseNumbertextCouncil-issued licence number
licenseTypetexte.g. Mandatory, Additional, Selective
licenseStatustexte.g. Active, Expired, Pending
issueDatetextLicence issue date
expiryDatetextLicence expiry date
maxOccupantsintMaximum permitted occupants
bedroomsintNumber of bedrooms
hashKeytextSHA-256 of (councilId + licenseNumber + postcode + address) — unique constraint for upsert deduplication
lastSeenAttimestampTimestamp of most recent scrape confirming the record

Background Jobs

Jobs run via Inngest and are registered at the /api/inngest endpoint.

hmo-register-scraper (cron)

  • Schedule: daily at 02:00 UTC
  • Behaviour: queries council_register_sources for all active sources where lastScrapedAt is older than 7 days (or null), then emits one hmo/register.source.scrape event per source.
  • This fan-out pattern keeps individual scrape jobs isolated and independently retriable.

hmo-register-source-scraper (event-driven)

  • Trigger: hmo/register.source.scrape
  • Behaviour:
    1. Fetches the register URL.
    2. Passes the raw HTML text to the Claude Haiku extractor.
    3. Upserts returned records into hmo_register_entries using onConflictDoUpdate on hashKey.
    4. Updates lastScrapedAt on the source row on success, or lastErrorAt / lastErrorMessage on failure.

AI Extraction

The extractor (src/lib/anthropic/extract-register.ts) uses Claude Haiku to parse raw HTML page text into structured records.

  • Handles up to 200 records per page.
  • Tolerates varying council page formats — no per-council template configuration required.
  • Returns an array of objects with the fields: address, landlordName, licenseNumber, licenseType, licenseStatus, issueDate, expiryDate, maxOccupants, bedrooms.
  • Requires the ANTHROPIC_API_KEY environment variable to be set.

tRPC API Reference

All procedures are under the hmoRegister router.

hmoRegister.stats

Returns aggregate counts for the KPI cards.

Input: none

Response:

{
  totalEntries: number;
  councilsCovered: number;
  activeSources: number;
  recentlyUpdated: number; // entries updated in the last 7 days
}

hmoRegister.list

Paginated search across all register entries.

Input:

{
  limit?: number;       // default 25
  cursor?: string;      // opaque pagination cursor
  search?: string;      // full-text match on address, landlord, licence number
  postcode?: string;    // filter by postcode
}

Response:

{
  items: RegisterEntry[];
  nextCursor: string | null;
  hasMore: boolean;
}

hmoRegister.listSources

Returns all configured register source URLs with their scrape status.

Input: {} (empty object)

Response: Array of council_register_sources rows.


hmoRegister.addSource

Adds a new council register URL to the scrape queue.

Input:

{
  councilId: string;
  url: string;
  dataFormatHint?: string;
}

hmoRegister.updateSourceStatus

Enables or disables an existing register source.

Input:

{
  id: string;
  status: "active" | "inactive" | "needs_review";
}

Dashboard Page

The national HMO register browser is available at /dashboard/register.

KPI Cards

CardDescription
Total EntriesTotal HMO records across all councils
Councils CoveredNumber of distinct councils with at least one scraped entry
Active SourcesNumber of enabled register source URLs
Recently UpdatedEntries refreshed in the last 7 days

Register Sources Panel

Collapsible panel listing every configured source URL with its scrape health (last scraped timestamp, error state, status badge). Use this to monitor scrape coverage and identify sources that need attention.

Search & Filter

  • Search: full-text match against address, landlord name, and licence number.
  • Postcode filter: narrow results to a specific postcode district or sector.
  • Changing either filter resets pagination to page 1.

Data Table

Displays paginated results with columns: Address, Council, Landlord, Licence Number, Licence Type, Status (colour-coded badge), Expiry Date.

Status badge colours:

  • Greenactive
  • Redexpired
  • Amberpending

Environment Variables

VariableRequiredDescription
ANTHROPIC_API_KEYYesAPI key for Claude Haiku extraction. Obtain from console.anthropic.com.

Deduplication

Each record is hashed with SHA-256 over the concatenation of councilId + licenseNumber + postcode + address. The hashKey column has a unique constraint. On re-scrape, onConflictDoUpdate refreshes the record fields and updates lastSeenAt, ensuring the table always reflects the latest data without creating duplicates.