Directory Crawl Job
Directory Crawl Job
The directory crawl job is an Inngest background function that automates the discovery of proptech vendors and products from supplier directories, starting with Kerfuffle.com.
Overview
When triggered with a directory URL, the job crawls all listing pages in that directory and extracts structured data for every vendor or product it finds. Results are persisted to the discovered_products table and used to feed the scoring and analysis pipeline.
How to Trigger a Crawl
- Log in to the platform (access restricted to
glyn@agentos.comanddylan@agentos.com). - On the main dashboard, paste a directory URL into the input field (e.g.
https://kerfuffle.com/suppliers). - Submit the URL — the Inngest background function is dispatched immediately.
- The job runs asynchronously; results populate the dashboard as they are persisted.
What Gets Extracted
For each listing found, the crawler extracts and stores the following fields in discovered_products:
| Field | Description |
|---|---|
name | Vendor or product name |
url | URL of the vendor/product |
category_tags | Category tags assigned in the directory |
description | Short description from the listing |
source_listing_url | The directory listing page the entry was found on |
Pagination
The crawler automatically follows paginated listing pages. A single job invocation will walk all pages in the directory, so you do not need to submit individual page URLs.
Rate-Limiting
Outbound requests to the target directory are rate-limited internally. This keeps crawl behaviour polite and reduces the risk of IP-level blocking from the source site. Crawl jobs may therefore take several minutes to complete for large directories.
Deduplication
Before persisting a discovered product, the job checks whether an entry with the same source URL already exists in discovered_products. If it does, the record is skipped. This means you can safely re-trigger a crawl on the same directory URL without creating duplicate rows — only net-new listings will be inserted.
Supported Directories
| Directory | Status |
|---|---|
| Kerfuffle.com | ✅ Supported |
Support for additional proptech supplier directories is planned for future releases.
Data Flow
User pastes URL
│
▼
Inngest job dispatched
│
▼
Crawler fetches listing pages (paginated)
│
▼
Per-listing extraction (name, URL, tags, description, source URL)
│
▼
Deduplication check against discovered_products
│
├─ Already exists → skip
│
└─ New entry → insert into discovered_products
│
▼
Dashboard updated
Notes
- The crawl job runs entirely in the background via Inngest. Closing your browser after submission will not interrupt the job.
- Rate-limiting is applied automatically — do not submit the same directory URL multiple times in rapid succession.
- The
discovered_productstable is the source for all downstream analysis, scoring, and dossier generation.