v0.1.6: Responsible Crawling with Rate Limits & Robots.txt Compliance
v0.1.6: Responsible Crawling with Rate Limits & Robots.txt Compliance
Version 0.1.6 adds a compliance and reliability layer to the crawler. These changes ensure the platform respects the crawling policies of every domain it visits, reduces the chance of IP blocks, and gives operators better visibility into why certain domains are skipped.
Why This Matters
Crawling proptech supplier directories at scale means touching dozens of domains in a single run. Without rate limiting and robots.txt compliance, the crawler risks being blocked, triggering abuse flags, or violating the terms of service of the sites it reads. This release addresses all three concerns.
What Changed
1. Per-Domain Rate Limiting
The crawler now enforces a request rate limit independently for each domain it visits.
- Default: 1 request per second per domain.
- Configurable: The rate can be adjusted per crawl run if a target domain is known to permit higher throughput — or requires a more conservative approach.
This prevents any single domain from being hammered and keeps the crawler's footprint light.
2. Robots.txt Compliance
Before issuing any requests to a new domain, the crawler fetches and parses robots.txt.
- Disallowed paths are automatically excluded from the crawl queue.
- The user-agent declared in requests is the same one used to evaluate
robots.txtrules, so site-specific bot policies are applied correctly. - If a domain blocks the crawler entirely, that domain is skipped and logged — no requests are made.
3. Named User-Agent Header
Every request now carries a User-Agent header that identifies the crawler by name. This is required for robots.txt rules to apply correctly and is standard practice for well-behaved bots.
4. Exponential Back-off on 429 / 503
When a server signals that the crawler is moving too fast (429 Too Many Requests) or is temporarily unavailable (503 Service Unavailable), the crawler now backs off and retries using an exponential delay.
This means:
- Transient throttling no longer causes a crawl run to fail outright.
- The crawler self-regulates under load without manual intervention.
5. Blocked Domain Logging
Any domain blocked by robots.txt is recorded in the crawl run record. You can inspect the run after completion to see exactly which domains were skipped and why — useful for auditing and for deciding whether to revisit a domain's policy manually.
Behaviour Summary
| Scenario | Behaviour |
|---|---|
| Domain allows crawling | Proceeds at configured rate (default 1 req/s) |
| Domain disallows specific paths | Those paths are skipped silently |
| Domain blocks crawler entirely | Domain logged as blocked; no requests made |
| Server returns 429 or 503 | Exponential back-off applied; request retried |
No Changes To
- Dashboard ranking and sorting
- Dossier generation and scoring pipeline
- Authentication and access control
- Existing crawl run data
Getting Started
The compliance layer is active automatically for all new crawl runs. No configuration is required to use the defaults. To adjust the rate limit for a specific run, set the rate limit parameter when initiating the crawl (see the API reference for crawl run parameters).