Forgejo Mirror Repository Hit by Distributed Web Crawlers: Full Investigation from Discovery to NGINX Blocking

Problem Description

A self-hosted Forgejo instance (with 500+ repositories, of which 484 are GitHub mirrors) is experiencing sustained high load: CPU usage at 1220%, and load average exceeding 19.

Troubleshooting Process

1. Identifying the Offending Process

Container monitoring revealed that the Forgejo container was consuming 1220% CPU and 936 MB of memory.

2. Analyzing Request Logs

Container logs showed ~272 requests per minute, of which 261 (~96%) targeted “git-heavy” endpoints (e.g., /commit, /blame, /src). Requests originated from numerous distinct IPs, all crawling commit/blame/src pages of mirrored repositories; each request took 100–1100 ms.

3. Confirming a Distributed Crawler

Each IP appeared only once — a hallmark of distributed crawlers or botnets.

Key characteristics:

  • Each IP issued only 1–2 requests, evading per-IP rate limiting.
  • Used legitimate User-Agents (e.g., Chrome 100+), bypassing UA-based detection.
  • Targeted exclusively git-heavy paths (/commit, /blame, /src, etc.).
  • Focused solely on mirrored repositories (e.g., Discourse: 501, WooCommerce: 220, Mautic: 121…); zero requests to original (non-mirrored) repositories.

4. Why Existing Protections Failed

Three layers of protection had already been deployed: robots.txt disallow directives, per-IP rate limiting (rate=3r/s), and UA-based bot detection (via Nginx map).

All failed because:

  • robots.txt: Crawlers simply ignore it.
  • Per-IP rate limiting: With only 1–2 requests per IP, thresholds were never triggered.
  • UA detection: Legitimate browser UAs were used.

Solution

Core Strategy: Differentiate Treatment by Repository Type

Git-heavy pages (e.g., /commit, /blame) for mirrored repositories provide no real value to users (historical browsing should be done on the upstream GitHub repository); thus, they should return HTTP 403 immediately. Original (non-mirrored) repositories remain fully accessible.

Pitfall Encountered: Nginx location Matching Priority

The initial implementation attempted to use a dedicated regex location block (e.g., location ~ ^/(.+)/commit) — but it never matched. Reason: The ^~ / prefix match (a catch-all) has higher priority than ~ regex matches. All requests were captured by ^~ /, preventing any regex location from ever executing.

Correct Implementation: Use if Inside the ^~ / Block

Within the catch-all ^~ / location block, use if statements to:

  • Match request URIs against patterns for git-heavy paths (/blame, /commit, /src/commit, /commits/commit, /raw/commit, etc.),
  • Exclude original (non-mirrored) repository owners via negative lookahead in the regex,
  • Immediately return 403 upon match.

Key points:

  • Negative lookahead must exclude usernames of original repositories.
  • The logic must reside inside the ^~ / block — standalone regex locations will never fire.
  • URI matching must cover all variants: /blame, /commit, /src/<commit>, /commits/<commit>, /raw/<commit>, etc.

Results

Metric Before After
Load Average >19 ~12
Forgejo CPU Usage 1220% ~850%
Git-heavy Requests/min (HTTP 200) 261 ~13
403 Interception Rate 0% ~95%

A small number of remaining unmatched requests stem from edge-case path variants (e.g., .diff suffixes); their impact is now negligible.

Lessons Learned

  1. Core signature of distributed crawlers: Each IP makes only 1–2 requests and uses legitimate UAs — making per-IP rate limiting and UA-based detection completely ineffective.
  2. Nginx location matching priority: ^~ prefix matches take precedence over ~ regex matches. If a ^~ / catch-all exists, standalone regex locations will never match.
  3. Mirrored repositories don’t need git history browsing: Returning HTTP 403 directly is the most effective mitigation — users needing historical context should visit the upstream repository.
  4. robots.txt only deters well-behaved crawlers: It offers zero protection against malicious bots — hard blocking at the Nginx layer is essential.
  5. Global rate limiting also has limits against distributed crawlers: Even with burst allowances, high-volume distributed traffic quickly exhausts the burst pool, while new requests continue arriving relentlessly.