Directory Crawl Job

The directory crawl job is an Inngest background function that automates the discovery of proptech vendors and products from supplier directories, starting with Kerfuffle.com.

Overview

When triggered with a directory URL, the job crawls all listing pages in that directory and extracts structured data for every vendor or product it finds. Results are persisted to the discovered_products table and used to feed the scoring and analysis pipeline.

How to Trigger a Crawl

Log in to the platform (access restricted to glyn@agentos.com and dylan@agentos.com).
On the main dashboard, paste a directory URL into the input field (e.g. https://kerfuffle.com/suppliers).
Submit the URL — the Inngest background function is dispatched immediately.
The job runs asynchronously; results populate the dashboard as they are persisted.

What Gets Extracted

For each listing found, the crawler extracts and stores the following fields in discovered_products:

Field	Description
`name`	Vendor or product name
`url`	URL of the vendor/product
`category_tags`	Category tags assigned in the directory
`description`	Short description from the listing
`source_listing_url`	The directory listing page the entry was found on

Pagination

The crawler automatically follows paginated listing pages. A single job invocation will walk all pages in the directory, so you do not need to submit individual page URLs.

Rate-Limiting

Outbound requests to the target directory are rate-limited internally. This keeps crawl behaviour polite and reduces the risk of IP-level blocking from the source site. Crawl jobs may therefore take several minutes to complete for large directories.

Deduplication

Before persisting a discovered product, the job checks whether an entry with the same source URL already exists in discovered_products. If it does, the record is skipped. This means you can safely re-trigger a crawl on the same directory URL without creating duplicate rows — only net-new listings will be inserted.

Supported Directories

Directory	Status
Kerfuffle.com	✅ Supported

Support for additional proptech supplier directories is planned for future releases.

Data Flow

User pastes URL
      │
      ▼
Inngest job dispatched
      │
      ▼
Crawler fetches listing pages (paginated)
      │
      ▼
Per-listing extraction (name, URL, tags, description, source URL)
      │
      ▼
Deduplication check against discovered_products
      │
      ├─ Already exists → skip
      │
      └─ New entry → insert into discovered_products
                          │
                          ▼
                  Dashboard updated

Notes

The crawl job runs entirely in the background via Inngest. Closing your browser after submission will not interrupt the job.
Rate-limiting is applied automatically — do not submit the same directory URL multiple times in rapid succession.
The discovered_products table is the source for all downstream analysis, scoring, and dossier generation.