Background
The WordPress ecosystem features a large-scale translation platform called GlotPress, which has accumulated hundreds of thousands of human-reviewed English-to-Chinese translations. These translations are scattered across thousands of plugins and themes—valuable linguistic resources that have never been systematically extracted into a terminology database.
Our goal: Automatically extract high-quality English–Chinese term pairs from 830,000 human translations to supplement our existing glossary of 1,167 entries.
Data Overview
- Source: Local clone of GlotPress (MySQL 5.7)
- Total translations: 833,926 (
status = 'current') - After deduplication: ~360,000 unique source–translation pairs
- Characteristics: High-frequency terms like Edit → 编辑 appear 1,651 times; ambiguous terms like Settings, however, have 34 distinct translations.
Core Metric: Dominance Ratio
The entire approach hinges on a simple yet effective statistical metric:
dominance = frequency of this translation / total frequency of the source term
For example, Edit appears 1,651 times across all projects, with 编辑 accounting for over 1,600 occurrences → dominance ≈ 0.97. This means 97% of translators chose the same rendering—highly trustworthy.
In contrast, Settings has 34 variants; its most frequent translation, 设置, accounts for only 72% → dominance < 0.8, indicating low consensus among translators and requiring further adjudication.
Four-Phase Pipeline
Phase 1: SQL-Based Statistical Filtering (No AI Required)
Direct SQL aggregation from the database, with filtering criteria:
- Source term ≤ 50 characters, starting with an English letter
- Translation variant appears ≥ 10 times
- Dominance ratio ≥ 0.8
Note: MySQL 5.7 lacks window functions, so subqueries are used:
-- Step 1: Frequency per (source, translation)
SELECT o.singular, t.translation_0, COUNT(*) AS freq
FROM wp_gp_translations t
JOIN wp_gp_originals o ON t.original_id = o.id
WHERE t.status = 'current'
AND CHAR_LENGTH(o.singular) <= 50
AND o.singular REGEXP '^[A-Za-z]'
GROUP BY o.singular, t.translation_0
HAVING freq >= 10
-- Step 2: Total frequency per source (computed separately in Python to derive dominance)
Then, in Python, filter out entries unsuitable as terminology:
| Filter Type | Count | Example |
|---|---|---|
| Full sentences | 149 | "Comments are closed." |
| Placeholders | 93 | "Page %s" |
| Brand names | 145 | Facebook = Facebook |
| HTML entities | 25 | "Next →" |
| Overly long phrases | 42 | >5 words |
| Extremely short tokens | 12 | ≤2 characters |
Phase 1 output: 2,933 high-confidence terms.
After ingestion, a cleanup pass archived 625 entries (geographic names, currencies, language names—e.g., Quebec → 魁北克, Kenyan Shilling → 肯尼亚先令), leaving 2,274 active terms.
Phase 2: Ambiguous Term Export
Terms with dominance < 0.8 reflect translator disagreement:
Settings → 设置(72%) | 设定(14%) | 配置(8%)
Email → 电子邮件(59%) | 邮箱地址(18%) | 邮箱(10%)
Home → 首页(75%) | 主页(25%)
None → 无(77%) | 無(12%) | 没有(5%)
Filtering criteria:
- ≥2 distinct translations for the same source term
- Each variant appears ≥3 times
- Total frequency ≥10
Exported in JSONL format—one JSON object per line—with source term, all variants, frequencies, and ratios:
{
"source": "Settings",
"variants": [
{"translation": "设置", "freq": 798, "ratio": 0.72},
{"translation": "设定", "freq": 156, "ratio": 0.14},
{"translation": "配置", "freq": 89, "ratio": 0.08}
],
"total_freq": 1105
}
Phase 2 output: 1,797 ambiguous terms.
Phase 3: Batch AI Classification
Ambiguous terms are submitted to an LLM (gemma3:12b) for batch adjudication: 50 terms per batch × 36 batches.
Prompt template:
You are a WordPress Chinese localization expert using Mainland Simplified Chinese standards.
The following source terms have multiple Chinese translations. Please evaluate each one:
{batch_items}
For each term, output JSON:
{
"source": "original term",
"standard": "Mainland Simplified Chinese standard translation",
"taiwan_variants": ["Taiwanese terms"],
"context_variants": [{"translation": "variant", "context": "applicable context"}],
"discard": ["clearly incorrect translations"],
"confidence": 5
}
The AI performs three tasks:
- Standard translation adjudication: Select the authoritative Mainland Simplified Chinese rendering.
- Taiwanese terminology tagging: Identify traditional characters and Taiwan-specific terms.
- Context-sensitive polysemy preservation: e.g., post → 文章 (content) / 帖子 (forum).
First run succeeded for 1,750/1,797 terms (97.4%). The remaining 47 failed cases were re-run in smaller batches (10 terms/batch), achieving 100% completion.
Result distribution:
confidence = 5: 1,482 termsconfidence = 4: 259 termsconfidence = 3: 9 terms
Ingestion rules:
- Standard translations with
confidence ≥ 4go intoactive. - Contextual variants go into
reviewfor manual verification.
Phase 3 output: 1,666 standard translations + 1,333 contextual variants.
Phase 4: Taiwanese Terminology Filtering (Three-Layer Defense)
A cross-cutting filtering mechanism applied throughout the pipeline:
- Character-level: Detect traditional characters not used in Mainland Simplified Chinese (e.g., 設, 資, 檔, 儲, 預, 佈).
- Term-level: Maintain a Taiwan→Mainland mapping table (e.g., 伺服器 → 服务器, 程式 → 程序, 儲存 → 保存, 外掛 → 插件 — 20+ pairs).
- AI fallback: Taiwan-related tags generated during Phase 3 (937 entries flagged).
Final Results
Raw data: 833,926 human translations
↓ SQL deduplication
Unique pairs: ~360,000
↓ Frequency + consistency filtering (Phase 1)
High-confidence terms: 2,933 → 2,274 after cleanup
↓ Ambiguous term export (Phase 2)
Ambiguous terms: 1,797
↓ AI classification & adjudication (Phase 3)
AI-adjudicated: 1,666 standard translations ingested + 1,333 variants pending review
↓
Glossary size: 1,167 → 5,107 active (+1,333 in review)
Lessons Learned
-
SQL does the heavy lifting; AI handles judgment: Of 830,000 entries, SQL statistics alone resolved 2,274 high-confidence terms (
dominance ≥ 0.8). AI only needed to process the remaining 1,797 ambiguous ones—extremely cost-efficient. -
Dominance ratio is highly effective: A simple frequency-based metric cleanly separates high-confidence terms from ambiguous ones. The 0.8 threshold proved robust in practice—terms below it truly require human or AI intervention.
-
Filtering matters more than extraction: Of the initial 2,933 candidates, 149 were full sentences, 145 were brand names, and 625 were geographic/currency/language names. Without rigorous filtering, the glossary would be severely polluted.
-
Taiwanese terminology is a persistent challenge: GlotPress data contains substantial contributions from Taiwanese translators. All three defense layers—character-level, term-level, and AI—are essential and non-redundant.
-
Batch AI calls must be fault-tolerant: At 50 terms/batch, failure rate was 2.6%; reducing to 10 terms/batch achieved 100% success. Best practice: Run large batches first, then reprocess failures at smaller scale.
Tech Stack
- Python 3 +
pymysql+sqlite3 - MySQL 5.7 (GlotPress data)
- SQLite (glossary database
glossary.db, layered architecture) - Ollama +
gemma3:12b(AI classification) - Cross-VM collaboration: NATS messaging + NAS file sharing
Full implementation in extract_gp_terms.py, supporting subcommands: phase1, phase2, import.