HMO Register Aggregation Pipeline
HMO Register Aggregation Pipeline
HMOwatch continuously scrapes UK local authority public HMO registers, extracts structured records using AI, and presents them in a searchable national dashboard. This page documents how the pipeline works, how to configure data sources, and how to query the data via the API.
Overview
The pipeline consists of four layers:
- Source registry — a database table of known council register URLs.
- Background jobs — scheduled and event-driven Inngest functions that fetch and process each source.
- AI extraction — Claude Haiku parses raw HTML into structured records.
- API + UI — tRPC endpoints and a Next.js dashboard page expose the aggregated data.
[Council Register URLs]
│
▼
[hmo-register-scraper] ← daily cron @ 02:00 UTC
│ fan-out per source
▼
[hmo-register-source-scraper] ← event-driven
│ fetch HTML → Claude Haiku extraction
▼
[hmo_register_entries table] ← upserted, SHA-256 deduplicated
│
▼
[tRPC hmoRegister router] ← /dashboard/register UI
Database Tables
council_register_sources
Tracks every known public HMO register URL.
| Column | Type | Description |
|---|---|---|
id | uuid | Primary key |
councilId | text | Reference to the parent council |
url | text | Public register URL to scrape |
status | enum | active | inactive | needs_review |
dataFormatHint | text | Optional hint to guide AI extraction |
lastScrapedAt | timestamp | When the source was last successfully scraped |
lastErrorAt | timestamp | When the source last errored |
lastErrorMessage | text | Most recent error detail |
hmo_register_entries
Stores individual HMO property/landlord records extracted from council registers.
| Column | Type | Description |
|---|---|---|
id | uuid | Primary key |
councilId | text | Owning council |
addressLine1 | text | Property address |
postcode | text | Property postcode |
landlordName | text | Licence holder name |
licenseNumber | text | Council-issued licence number |
licenseType | text | e.g. Mandatory, Additional, Selective |
licenseStatus | text | e.g. Active, Expired, Pending |
issueDate | text | Licence issue date |
expiryDate | text | Licence expiry date |
maxOccupants | int | Maximum permitted occupants |
bedrooms | int | Number of bedrooms |
hashKey | text | SHA-256 of (councilId + licenseNumber + postcode + address) — unique constraint for upsert deduplication |
lastSeenAt | timestamp | Timestamp of most recent scrape confirming the record |
Background Jobs
Jobs run via Inngest and are registered at the /api/inngest endpoint.
hmo-register-scraper (cron)
- Schedule: daily at
02:00 UTC - Behaviour: queries
council_register_sourcesfor allactivesources wherelastScrapedAtis older than 7 days (or null), then emits onehmo/register.source.scrapeevent per source. - This fan-out pattern keeps individual scrape jobs isolated and independently retriable.
hmo-register-source-scraper (event-driven)
- Trigger:
hmo/register.source.scrape - Behaviour:
- Fetches the register URL.
- Passes the raw HTML text to the Claude Haiku extractor.
- Upserts returned records into
hmo_register_entriesusingonConflictDoUpdateonhashKey. - Updates
lastScrapedAton the source row on success, orlastErrorAt/lastErrorMessageon failure.
AI Extraction
The extractor (src/lib/anthropic/extract-register.ts) uses Claude Haiku to parse raw HTML page text into structured records.
- Handles up to 200 records per page.
- Tolerates varying council page formats — no per-council template configuration required.
- Returns an array of objects with the fields:
address,landlordName,licenseNumber,licenseType,licenseStatus,issueDate,expiryDate,maxOccupants,bedrooms. - Requires the
ANTHROPIC_API_KEYenvironment variable to be set.
tRPC API Reference
All procedures are under the hmoRegister router.
hmoRegister.stats
Returns aggregate counts for the KPI cards.
Input: none
Response:
{
totalEntries: number;
councilsCovered: number;
activeSources: number;
recentlyUpdated: number; // entries updated in the last 7 days
}
hmoRegister.list
Paginated search across all register entries.
Input:
{
limit?: number; // default 25
cursor?: string; // opaque pagination cursor
search?: string; // full-text match on address, landlord, licence number
postcode?: string; // filter by postcode
}
Response:
{
items: RegisterEntry[];
nextCursor: string | null;
hasMore: boolean;
}
hmoRegister.listSources
Returns all configured register source URLs with their scrape status.
Input: {} (empty object)
Response: Array of council_register_sources rows.
hmoRegister.addSource
Adds a new council register URL to the scrape queue.
Input:
{
councilId: string;
url: string;
dataFormatHint?: string;
}
hmoRegister.updateSourceStatus
Enables or disables an existing register source.
Input:
{
id: string;
status: "active" | "inactive" | "needs_review";
}
Dashboard Page
The national HMO register browser is available at /dashboard/register.
KPI Cards
| Card | Description |
|---|---|
| Total Entries | Total HMO records across all councils |
| Councils Covered | Number of distinct councils with at least one scraped entry |
| Active Sources | Number of enabled register source URLs |
| Recently Updated | Entries refreshed in the last 7 days |
Register Sources Panel
Collapsible panel listing every configured source URL with its scrape health (last scraped timestamp, error state, status badge). Use this to monitor scrape coverage and identify sources that need attention.
Search & Filter
- Search: full-text match against address, landlord name, and licence number.
- Postcode filter: narrow results to a specific postcode district or sector.
- Changing either filter resets pagination to page 1.
Data Table
Displays paginated results with columns: Address, Council, Landlord, Licence Number, Licence Type, Status (colour-coded badge), Expiry Date.
Status badge colours:
- Green —
active - Red —
expired - Amber —
pending
Environment Variables
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY | Yes | API key for Claude Haiku extraction. Obtain from console.anthropic.com. |
Deduplication
Each record is hashed with SHA-256 over the concatenation of councilId + licenseNumber + postcode + address. The hashKey column has a unique constraint. On re-scrape, onConflictDoUpdate refreshes the record fields and updates lastSeenAt, ensuring the table always reflects the latest data without creating duplicates.