Background
Previously, I published an article titled “NATS Communication Common Issues Troubleshooting Guide”, which resolved foundational health issues in the communication chain. However, weixiaoduo raised a deeper efficiency concern: cross-VM collaboration chains are too long, requiring manual orchestration at every step.
A typical scenario:
wenpai identifies spam domains → sends a message to weixiaoduo → weixiaoduo checks against the production database → sends results to kali for review → kali replies → weixiaoduo forwards the reply to wenpai → wenpai executes deletion.
This involves four steps across three VMs, with manual forwarding required at each step.
This post documents the complete implementation process of automating the vm-task pipeline—including pitfalls encountered along the way.
What Was Implemented
1. vm-task complete --result: Data Passing Between Stages
Previously, vm-task complete only marked a stage as complete and notified the next VM (“It’s your turn now”), without carrying any data. Now:
# Complete current stage and attach output
vm-task complete <task-id> --result 'Query result: 15 spam domains matched'
- The
resultis stored inside thestageobject of the task’s JSON. - When notifying the next VM, the previous stage’s
resultis automatically included. vm-task show <id>highlights outputs from each stage.
2. Pipeline Auto-Execution
The watcher’s check_pipeline_tasks() function polls the pipeline directory on NAS every 60 seconds. Upon detecting a ready stage assigned to this VM:
- Reads the task description, action, and previous stage’s
result. - Marks the stage as
in_progress. - Launches an isolated Claude session using
systemd-run --user --scope. - Claude automatically executes the action; upon completion, calls
vm-task complete --result. - The next VM’s watcher detects the updated state → repeats the above flow.
No human intervention is required at any point.
3. Inbox Noise Reduction
- The watcher automatically archives messages whose TTL has expired (info: 24h / normal: 48h / urgent: 7 days).
session-bootstrapgroups incoming items by priority (requires immediate attention → pending → for reference only).
Pitfalls Encountered
Pitfall 1: vm-nats-doctor Connection Test Producing False Positives
Phenomenon: All VMs report “Unable to connect to NATS server”, yet actual communication works fine.
Root Cause:
- The
doctorscript did not setNATS_CONTEXT=vm-hub, so thenatsCLI could not locate the correct connection configuration. - Even after fixing that, errors persisted —
nats pub vm.healthcheck.xxxtriggered NATS subject permission restrictions (Permissions Violation).
Fix: Set export NATS_CONTEXT="vm-hub" and switch the test subject to vm.dm.${VM_NAME} (each VM has publish permissions for its own DM subject).
Pitfall 2: Chinese Characters Filtered Out from Actions
Phenomenon: In the pipeline prompt, the text following “Your stage:” appears blank.
Root Cause: The filter tr -cd 'a-zA-Z0-9._- ' stripped out all Chinese characters.
Fix: Strictly filter only task_id (to prevent path traversal), but truncate action by length only—without filtering character sets.
Pitfall 3: Claude Subprocess Core Dump (Most Subtle)
Phenomenon: The watcher automatically triggers a Claude session, which crashes with Aborted (core dumped) after ~9 seconds. Yet manually running claude -p via SSH on the same VM works perfectly.
Investigation Process:
- Checked Claude CLI version → OK (2.1.59).
- Verified credentials → no
.credentials.json, but API key in environment variables was valid. - Manually tested
claude -p 'reply HELLO'→ returned normally. - Checked watcher logs →
line 124: Aborted (core dumped). - Key clue: Works fine over SSH, fails only when launched by watcher → environmental difference.
Root Cause: The vm-watcher systemd unit had MemoryMax=256M. Claude Code is a Node.js application requiring >200 MB memory just to start. Child processes launched by watcher via nohup inherited the cgroup memory limit and were killed due to insufficient memory.
# vm-watcher.service
MemoryMax=256M # ← This restricts *all* child processes
MemoryHigh=192M
TasksMax=64
Fix: Replace all instances where watcher launches Claude with systemd-run --user --scope, running it in an independent cgroup:
# Before (inherits watcher’s 256M limit)
nohup "$auto_script" >> "$LOG_FILE" 2>&1 &
# After (independent scope, 1 GB memory)
systemd-run --user --scope -p MemoryMax=1G -p CPUQuota=80% \
"$auto_script" >> "$LOG_FILE" 2>&1 &
Lesson Learned: systemd resource limits apply at the cgroup level and affect all descendant processes. When the watcher must launch heavyweight subprocesses, always use systemd-run --scope to isolate them.
Validation Results
Test pipeline created: fedora-devops → wenpai
12:17:57 fedora-devops creates task test-auto-v4
12:18:02 fedora-devops completes stage 1, result: "hostname: fedora-devops"
→ wenpai receives notification
~12:18:30 wenpai watcher detects ready stage → systemd-run launches Claude
12:19:16 wenpai’s Claude auto-executes hostname → vm-task complete --result "hostname: wenpai"
→ All stages completed
End-to-end execution time: ~1 minute 15 seconds — fully automated, zero manual intervention.
Current Status
- Deployed across all 9 VMs cluster-wide.
- All
vm-nats-doctorchecks passed (12/12, 0 failures). - Pipeline auto-execution verified and stable.
- Code pushed to the
feicodeAnsible repository.
Next Steps
- Predefine common pipeline templates (e.g., security review workflow, product launch workflow).
- Build a
vm-task dashboard: global view of task lifecycle and status. - Add pipeline timeout mechanism: auto-alert if a stage remains incomplete beyond N minutes.