Scraper Plugin Architecture
Scraper Plugin Architecture
As of v0.1.46, the scraper engine supports a plugin registry that allows site-specific extraction rules to be defined for known high-value domains. This replaces the previous one-size-fits-all approach for supported sites while preserving the generic fallback for everything else.
Overview
The plugin registry maps a domain (or URL pattern) to a plugin module. When the scraper encounters a URL, it checks the registry for a matching plugin. If one is found, that plugin's extraction rules are used; otherwise the generic heuristic pipeline runs instead.
Incoming URL
│
▼
┌─────────────────────┐
│ Plugin Registry │ ◄── src/scrapers/plugins/
│ (domain lookup) │
└─────────┬───────────┘
│
┌─────┴──────┐
│ │
Match No match
│ │
▼ ▼
Plugin Generic
Extractor Heuristics
Plugin Location
All plugins live in:
src/scrapers/plugins/
Each plugin is its own file within this directory and must conform to the standard plugin interface (see below).
Standard Plugin Interface
Every plugin exports an object (or class instance) that satisfies the following shape:
interface ScraperPlugin {
/** Domain or URL pattern this plugin handles */
domain: string | RegExp;
/** Human-readable name for logging and the UI */
name: string;
/**
* Extraction rules. Use one or more of:
* - selectors: CSS selector map
* - xpath: XPath expression map
* - api: API endpoint definitions
*/
rules: {
selectors?: Record<string, string>;
xpath?: Record<string, string>;
api?: ApiEndpointDefinition[];
};
/**
* Optional transform applied to raw extracted data
* before it enters the scoring pipeline.
*/
transform?: (raw: Record<string, unknown>) => ScrapedProduct;
}
The exact TypeScript types are defined in
src/scrapers/plugins/index.ts(the registry entry point).
Bundled Plugins
Kerfuffle
Targets the Kerfuffle proptech supplier directory — the primary data source for this tool.
- Extracts supplier listing cards, product names, category tags, and review summaries using CSS selectors matched to Kerfuffle's listing page structure.
- Feeds directly into the product discovery and scoring pipeline.
Capterra PropTech Pages
Targets Capterra's PropTech category pages.
- Extracts product names, aggregate ratings, review counts, feature lists, and pricing signals.
- Uses a combination of CSS selectors and XPath to handle Capterra's mixed rendering patterns.
Fallback Behaviour
If no plugin is registered for a given domain, the scraper automatically falls back to the generic heuristic extractor. This means:
- Pasting any directory URL will always produce a result.
- Data quality will be highest for domains with a registered plugin.
- No configuration or manual flagging is needed for unknown sites.
Adding a New Plugin
- Create a new file in
src/scrapers/plugins/, e.g.g2.ts. - Implement the
ScraperPlugininterface. - Register the plugin in the registry (e.g.
src/scrapers/plugins/index.ts). - Deploy — the registry is loaded at startup.
Minimal example:
// src/scrapers/plugins/g2.ts
import type { ScraperPlugin } from './index';
const g2Plugin: ScraperPlugin = {
domain: 'g2.com',
name: 'G2',
rules: {
selectors: {
productName: 'h1.product-name',
rating: '[data-test-id="rating-value"]',
reviewCount: '[data-test-id="review-count"]',
pricingTier: '.pricing-tier-label',
},
},
};
export default g2Plugin;
Relationship to the Scoring Pipeline
Extracted data — regardless of whether it came from a plugin or the generic heuristic — flows into the same downstream scoring pipeline:
- Replicability score — complexity signals from the feature list.
- Market demand score — review volume and sentiment.
- Revenue potential score — pricing tier and market size signals.
- Competitive gaps — themes surfaced from negative reviews.
Higher-fidelity extraction (via plugins) directly improves the accuracy of these scores for the domains that matter most.