v0.1.6: Responsible Crawling with Rate Limits & Robots.txt Compliance

Version 0.1.6 adds a compliance and reliability layer to the crawler. These changes ensure the platform respects the crawling policies of every domain it visits, reduces the chance of IP blocks, and gives operators better visibility into why certain domains are skipped.

Why This Matters

Crawling proptech supplier directories at scale means touching dozens of domains in a single run. Without rate limiting and robots.txt compliance, the crawler risks being blocked, triggering abuse flags, or violating the terms of service of the sites it reads. This release addresses all three concerns.

What Changed

1. Per-Domain Rate Limiting

The crawler now enforces a request rate limit independently for each domain it visits.

Default: 1 request per second per domain.
Configurable: The rate can be adjusted per crawl run if a target domain is known to permit higher throughput — or requires a more conservative approach.

This prevents any single domain from being hammered and keeps the crawler's footprint light.

2. Robots.txt Compliance

Before issuing any requests to a new domain, the crawler fetches and parses robots.txt.

Disallowed paths are automatically excluded from the crawl queue.
The user-agent declared in requests is the same one used to evaluate robots.txt rules, so site-specific bot policies are applied correctly.
If a domain blocks the crawler entirely, that domain is skipped and logged — no requests are made.

3. Named User-Agent Header

Every request now carries a User-Agent header that identifies the crawler by name. This is required for robots.txt rules to apply correctly and is standard practice for well-behaved bots.

4. Exponential Back-off on 429 / 503

When a server signals that the crawler is moving too fast (429 Too Many Requests) or is temporarily unavailable (503 Service Unavailable), the crawler now backs off and retries using an exponential delay.

This means:

Transient throttling no longer causes a crawl run to fail outright.
The crawler self-regulates under load without manual intervention.

5. Blocked Domain Logging

Any domain blocked by robots.txt is recorded in the crawl run record. You can inspect the run after completion to see exactly which domains were skipped and why — useful for auditing and for deciding whether to revisit a domain's policy manually.

Behaviour Summary

Scenario	Behaviour
Domain allows crawling	Proceeds at configured rate (default 1 req/s)
Domain disallows specific paths	Those paths are skipped silently
Domain blocks crawler entirely	Domain logged as blocked; no requests made
Server returns 429 or 503	Exponential back-off applied; request retried

No Changes To

Dashboard ranking and sorting
Dossier generation and scoring pipeline
Authentication and access control
Existing crawl run data

Getting Started

The compliance layer is active automatically for all new crawl runs. No configuration is required to use the defaults. To adjust the rate limit for a specific run, set the rate limit parameter when initiating the crawl (see the API reference for crawl run parameters).