Extracting 4,000 Terms from 830,000 Human Translations: A Practical Four-Stage SQL + AI Pipeline

Background

The WordPress ecosystem features a large-scale translation platform called GlotPress, which has accumulated hundreds of thousands of human-reviewed English-to-Chinese translations. These translations are scattered across thousands of plugins and themes—valuable linguistic resources that have never been systematically extracted into a terminology database.

Our goal: Automatically extract high-quality English–Chinese term pairs from 830,000 human translations to supplement our existing glossary of 1,167 entries.

Data Overview

  • Source: Local clone of GlotPress (MySQL 5.7)
  • Total translations: 833,926 (status = 'current')
  • After deduplication: ~360,000 unique source–translation pairs
  • Characteristics: High-frequency terms like Edit编辑 appear 1,651 times; ambiguous terms like Settings, however, have 34 distinct translations.

Core Metric: Dominance Ratio

The entire approach hinges on a simple yet effective statistical metric:

dominance = frequency of this translation / total frequency of the source term

For example, Edit appears 1,651 times across all projects, with 编辑 accounting for over 1,600 occurrences → dominance ≈ 0.97. This means 97% of translators chose the same rendering—highly trustworthy.

In contrast, Settings has 34 variants; its most frequent translation, 设置, accounts for only 72% → dominance < 0.8, indicating low consensus among translators and requiring further adjudication.

Four-Phase Pipeline

Phase 1: SQL-Based Statistical Filtering (No AI Required)

Direct SQL aggregation from the database, with filtering criteria:

  • Source term ≤ 50 characters, starting with an English letter
  • Translation variant appears ≥ 10 times
  • Dominance ratio ≥ 0.8

:warning: Note: MySQL 5.7 lacks window functions, so subqueries are used:

-- Step 1: Frequency per (source, translation)
SELECT o.singular, t.translation_0, COUNT(*) AS freq
FROM wp_gp_translations t
JOIN wp_gp_originals o ON t.original_id = o.id
WHERE t.status = 'current'
  AND CHAR_LENGTH(o.singular) <= 50
  AND o.singular REGEXP '^[A-Za-z]'
GROUP BY o.singular, t.translation_0
HAVING freq >= 10

-- Step 2: Total frequency per source (computed separately in Python to derive dominance)

Then, in Python, filter out entries unsuitable as terminology:

Filter Type Count Example
Full sentences 149 "Comments are closed."
Placeholders 93 "Page %s"
Brand names 145 Facebook = Facebook
HTML entities 25 "Next &rarr;"
Overly long phrases 42 >5 words
Extremely short tokens 12 ≤2 characters

Phase 1 output: 2,933 high-confidence terms.

After ingestion, a cleanup pass archived 625 entries (geographic names, currencies, language names—e.g., Quebec魁北克, Kenyan Shilling肯尼亚先令), leaving 2,274 active terms.

Phase 2: Ambiguous Term Export

Terms with dominance < 0.8 reflect translator disagreement:

Settings → 设置(72%) | 设定(14%) | 配置(8%)
Email    → 电子邮件(59%) | 邮箱地址(18%) | 邮箱(10%)
Home     → 首页(75%) | 主页(25%)
None     → 无(77%) | 無(12%) | 没有(5%)

Filtering criteria:

  • ≥2 distinct translations for the same source term
  • Each variant appears ≥3 times
  • Total frequency ≥10

Exported in JSONL format—one JSON object per line—with source term, all variants, frequencies, and ratios:

{
  "source": "Settings",
  "variants": [
    {"translation": "设置", "freq": 798, "ratio": 0.72},
    {"translation": "设定", "freq": 156, "ratio": 0.14},
    {"translation": "配置", "freq": 89, "ratio": 0.08}
  ],
  "total_freq": 1105
}

Phase 2 output: 1,797 ambiguous terms.

Phase 3: Batch AI Classification

Ambiguous terms are submitted to an LLM (gemma3:12b) for batch adjudication: 50 terms per batch × 36 batches.

Prompt template:

You are a WordPress Chinese localization expert using Mainland Simplified Chinese standards.
The following source terms have multiple Chinese translations. Please evaluate each one:

{batch_items}

For each term, output JSON:
{
  "source": "original term",
  "standard": "Mainland Simplified Chinese standard translation",
  "taiwan_variants": ["Taiwanese terms"],
  "context_variants": [{"translation": "variant", "context": "applicable context"}],
  "discard": ["clearly incorrect translations"],
  "confidence": 5
}

The AI performs three tasks:

  1. Standard translation adjudication: Select the authoritative Mainland Simplified Chinese rendering.
  2. Taiwanese terminology tagging: Identify traditional characters and Taiwan-specific terms.
  3. Context-sensitive polysemy preservation: e.g., post文章 (content) / 帖子 (forum).

First run succeeded for 1,750/1,797 terms (97.4%). The remaining 47 failed cases were re-run in smaller batches (10 terms/batch), achieving 100% completion.

Result distribution:

  • confidence = 5: 1,482 terms
  • confidence = 4: 259 terms
  • confidence = 3: 9 terms

Ingestion rules:

  • Standard translations with confidence ≥ 4 go into active.
  • Contextual variants go into review for manual verification.

Phase 3 output: 1,666 standard translations + 1,333 contextual variants.

Phase 4: Taiwanese Terminology Filtering (Three-Layer Defense)

A cross-cutting filtering mechanism applied throughout the pipeline:

  1. Character-level: Detect traditional characters not used in Mainland Simplified Chinese (e.g., 設, 資, 檔, 儲, 預, 佈).
  2. Term-level: Maintain a Taiwan→Mainland mapping table (e.g., 伺服器服务器, 程式程序, 儲存保存, 外掛插件 — 20+ pairs).
  3. AI fallback: Taiwan-related tags generated during Phase 3 (937 entries flagged).

Final Results

Raw data:           833,926 human translations
    ↓ SQL deduplication
Unique pairs:       ~360,000
    ↓ Frequency + consistency filtering (Phase 1)
High-confidence terms: 2,933 → 2,274 after cleanup
    ↓ Ambiguous term export (Phase 2)
Ambiguous terms:    1,797
    ↓ AI classification & adjudication (Phase 3)
AI-adjudicated:     1,666 standard translations ingested + 1,333 variants pending review
    ↓
Glossary size:      1,167 → 5,107 active (+1,333 in review)

Lessons Learned

  1. SQL does the heavy lifting; AI handles judgment: Of 830,000 entries, SQL statistics alone resolved 2,274 high-confidence terms (dominance ≥ 0.8). AI only needed to process the remaining 1,797 ambiguous ones—extremely cost-efficient.

  2. Dominance ratio is highly effective: A simple frequency-based metric cleanly separates high-confidence terms from ambiguous ones. The 0.8 threshold proved robust in practice—terms below it truly require human or AI intervention.

  3. Filtering matters more than extraction: Of the initial 2,933 candidates, 149 were full sentences, 145 were brand names, and 625 were geographic/currency/language names. Without rigorous filtering, the glossary would be severely polluted.

  4. Taiwanese terminology is a persistent challenge: GlotPress data contains substantial contributions from Taiwanese translators. All three defense layers—character-level, term-level, and AI—are essential and non-redundant.

  5. Batch AI calls must be fault-tolerant: At 50 terms/batch, failure rate was 2.6%; reducing to 10 terms/batch achieved 100% success. Best practice: Run large batches first, then reprocess failures at smaller scale.

Tech Stack

  • Python 3 + pymysql + sqlite3
  • MySQL 5.7 (GlotPress data)
  • SQLite (glossary database glossary.db, layered architecture)
  • Ollama + gemma3:12b (AI classification)
  • Cross-VM collaboration: NATS messaging + NAS file sharing

Full implementation in extract_gp_terms.py, supporting subcommands: phase1, phase2, import.