Problem Description
A self-hosted Forgejo instance (with 500+ repositories, of which 484 are GitHub mirrors) is experiencing sustained high load: CPU usage at 1220%, and load average exceeding 19.
Troubleshooting Process
1. Identifying the Offending Process
Container monitoring revealed that the Forgejo container was consuming 1220% CPU and 936 MB of memory.
2. Analyzing Request Logs
Container logs showed ~272 requests per minute, of which 261 (~96%) targeted “git-heavy” endpoints (e.g., /commit, /blame, /src). Requests originated from numerous distinct IPs, all crawling commit/blame/src pages of mirrored repositories; each request took 100–1100 ms.
3. Confirming a Distributed Crawler
Each IP appeared only once — a hallmark of distributed crawlers or botnets.
Key characteristics:
- Each IP issued only 1–2 requests, evading per-IP rate limiting.
- Used legitimate User-Agents (e.g., Chrome 100+), bypassing UA-based detection.
- Targeted exclusively git-heavy paths (
/commit,/blame,/src, etc.). - Focused solely on mirrored repositories (e.g., Discourse: 501, WooCommerce: 220, Mautic: 121…); zero requests to original (non-mirrored) repositories.
4. Why Existing Protections Failed
Three layers of protection had already been deployed: robots.txt disallow directives, per-IP rate limiting (rate=3r/s), and UA-based bot detection (via Nginx map).
All failed because:
robots.txt: Crawlers simply ignore it.- Per-IP rate limiting: With only 1–2 requests per IP, thresholds were never triggered.
- UA detection: Legitimate browser UAs were used.
Solution
Core Strategy: Differentiate Treatment by Repository Type
Git-heavy pages (e.g., /commit, /blame) for mirrored repositories provide no real value to users (historical browsing should be done on the upstream GitHub repository); thus, they should return HTTP 403 immediately. Original (non-mirrored) repositories remain fully accessible.
Pitfall Encountered: Nginx location Matching Priority
The initial implementation attempted to use a dedicated regex location block (e.g., location ~ ^/(.+)/commit) — but it never matched. Reason: The ^~ / prefix match (a catch-all) has higher priority than ~ regex matches. All requests were captured by ^~ /, preventing any regex location from ever executing.
Correct Implementation: Use if Inside the ^~ / Block
Within the catch-all ^~ / location block, use if statements to:
- Match request URIs against patterns for git-heavy paths (
/blame,/commit,/src/commit,/commits/commit,/raw/commit, etc.), - Exclude original (non-mirrored) repository owners via negative lookahead in the regex,
- Immediately
return 403upon match.
Key points:
- Negative lookahead must exclude usernames of original repositories.
- The logic must reside inside the
^~ /block — standalone regex locations will never fire. - URI matching must cover all variants:
/blame,/commit,/src/<commit>,/commits/<commit>,/raw/<commit>, etc.
Results
| Metric | Before | After |
|---|---|---|
| Load Average | >19 | ~12 |
| Forgejo CPU Usage | 1220% | ~850% |
| Git-heavy Requests/min (HTTP 200) | 261 | ~13 |
| 403 Interception Rate | 0% | ~95% |
A small number of remaining unmatched requests stem from edge-case path variants (e.g., .diff suffixes); their impact is now negligible.
Lessons Learned
- Core signature of distributed crawlers: Each IP makes only 1–2 requests and uses legitimate UAs — making per-IP rate limiting and UA-based detection completely ineffective.
- Nginx
locationmatching priority:^~prefix matches take precedence over~regex matches. If a^~ /catch-all exists, standalone regex locations will never match. - Mirrored repositories don’t need git history browsing: Returning HTTP 403 directly is the most effective mitigation — users needing historical context should visit the upstream repository.
robots.txtonly deters well-behaved crawlers: It offers zero protection against malicious bots — hard blocking at the Nginx layer is essential.- Global rate limiting also has limits against distributed crawlers: Even with burst allowances, high-volume distributed traffic quickly exhausts the burst pool, while new requests continue arriving relentlessly.