Vm-task Pipeline Automation: From Manual Chaining to Fully Hands-off End-to-End Execution

Background

Previously, I published an article titled “NATS Communication Common Issues Troubleshooting Guide”, which resolved foundational health issues in the communication chain. However, weixiaoduo raised a deeper efficiency concern: cross-VM collaboration chains are too long, requiring manual orchestration at every step.

A typical scenario:
wenpai identifies spam domains → sends a message to weixiaoduo → weixiaoduo checks against the production database → sends results to kali for review → kali replies → weixiaoduo forwards the reply to wenpai → wenpai executes deletion.
This involves four steps across three VMs, with manual forwarding required at each step.

This post documents the complete implementation process of automating the vm-task pipeline—including pitfalls encountered along the way.


What Was Implemented

1. vm-task complete --result: Data Passing Between Stages

Previously, vm-task complete only marked a stage as complete and notified the next VM (“It’s your turn now”), without carrying any data. Now:

# Complete current stage and attach output
vm-task complete <task-id> --result 'Query result: 15 spam domains matched'
  • The result is stored inside the stage object of the task’s JSON.
  • When notifying the next VM, the previous stage’s result is automatically included.
  • vm-task show <id> highlights outputs from each stage.

2. Pipeline Auto-Execution

The watcher’s check_pipeline_tasks() function polls the pipeline directory on NAS every 60 seconds. Upon detecting a ready stage assigned to this VM:

  1. Reads the task description, action, and previous stage’s result.
  2. Marks the stage as in_progress.
  3. Launches an isolated Claude session using systemd-run --user --scope.
  4. Claude automatically executes the action; upon completion, calls vm-task complete --result.
  5. The next VM’s watcher detects the updated state → repeats the above flow.

No human intervention is required at any point.

3. Inbox Noise Reduction

  • The watcher automatically archives messages whose TTL has expired (info: 24h / normal: 48h / urgent: 7 days).
  • session-bootstrap groups incoming items by priority (requires immediate attention → pending → for reference only).

Pitfalls Encountered

Pitfall 1: vm-nats-doctor Connection Test Producing False Positives

Phenomenon: All VMs report “Unable to connect to NATS server”, yet actual communication works fine.

Root Cause:

  1. The doctor script did not set NATS_CONTEXT=vm-hub, so the nats CLI could not locate the correct connection configuration.
  2. Even after fixing that, errors persisted — nats pub vm.healthcheck.xxx triggered NATS subject permission restrictions (Permissions Violation).

Fix: Set export NATS_CONTEXT="vm-hub" and switch the test subject to vm.dm.${VM_NAME} (each VM has publish permissions for its own DM subject).

Pitfall 2: Chinese Characters Filtered Out from Actions

Phenomenon: In the pipeline prompt, the text following “Your stage:” appears blank.

Root Cause: The filter tr -cd 'a-zA-Z0-9._- ' stripped out all Chinese characters.

Fix: Strictly filter only task_id (to prevent path traversal), but truncate action by length only—without filtering character sets.

Pitfall 3: Claude Subprocess Core Dump (Most Subtle)

Phenomenon: The watcher automatically triggers a Claude session, which crashes with Aborted (core dumped) after ~9 seconds. Yet manually running claude -p via SSH on the same VM works perfectly.

Investigation Process:

  1. Checked Claude CLI version → OK (2.1.59).
  2. Verified credentials → no .credentials.json, but API key in environment variables was valid.
  3. Manually tested claude -p 'reply HELLO' → returned normally.
  4. Checked watcher logs → line 124: Aborted (core dumped).
  5. Key clue: Works fine over SSH, fails only when launched by watcher → environmental difference.

Root Cause: The vm-watcher systemd unit had MemoryMax=256M. Claude Code is a Node.js application requiring >200 MB memory just to start. Child processes launched by watcher via nohup inherited the cgroup memory limit and were killed due to insufficient memory.

# vm-watcher.service
MemoryMax=256M    # ← This restricts *all* child processes
MemoryHigh=192M
TasksMax=64

Fix: Replace all instances where watcher launches Claude with systemd-run --user --scope, running it in an independent cgroup:

# Before (inherits watcher’s 256M limit)
nohup "$auto_script" >> "$LOG_FILE" 2>&1 &

# After (independent scope, 1 GB memory)
systemd-run --user --scope -p MemoryMax=1G -p CPUQuota=80% \
    "$auto_script" >> "$LOG_FILE" 2>&1 &

Lesson Learned: systemd resource limits apply at the cgroup level and affect all descendant processes. When the watcher must launch heavyweight subprocesses, always use systemd-run --scope to isolate them.


Validation Results

Test pipeline created: fedora-devopswenpai

12:17:57  fedora-devops creates task test-auto-v4
12:18:02  fedora-devops completes stage 1, result: "hostname: fedora-devops"
          → wenpai receives notification
~12:18:30 wenpai watcher detects ready stage → systemd-run launches Claude
12:19:16  wenpai’s Claude auto-executes hostname → vm-task complete --result "hostname: wenpai"
          → All stages completed

End-to-end execution time: ~1 minute 15 seconds — fully automated, zero manual intervention.


Current Status

  • Deployed across all 9 VMs cluster-wide.
  • All vm-nats-doctor checks passed (12/12, 0 failures).
  • Pipeline auto-execution verified and stable.
  • Code pushed to the feicode Ansible repository.

Next Steps

  • Predefine common pipeline templates (e.g., security review workflow, product launch workflow).
  • Build a vm-task dashboard: global view of task lifecycle and status.
  • Add pipeline timeout mechanism: auto-alert if a stage remains incomplete beyond N minutes.