AI Team: Troubleshooting and Fixing Cross-VM File Access Issues

AI-Team Cross-VM File Access Issue Investigation and Resolution

Problem Background

Modiqi encountered issues while using AI-Team for batch processing: out of 8 submitted batches, only the first succeeded; the remaining 7 either failed or displayed “completed” without generating result files.

Symptoms:

  • Batch 1: Success (41 KB result file)
  • Batches 2/5/6: Displayed “completed” but no result files
  • Batch 3: Failed (retry_failed: exit_code=1)
  • Batch 4: Failed (timeout stale)
  • Batches 7/8: Still running/queued

Root Cause Analysis

After thorough investigation, two root causes were identified:

1. Architectural Limitation: Cross-VM File Access

Core Issue: Agents run on the ai-team VM and cannot access local filesystems of other VMs.

  • Modiqi’s CSV files reside at /tmp/batch-*.csv (local to the modiqi VM)
  • team-run submits tasks referencing these paths
  • When the Agent executes on the ai-team VM, it cannot read /tmp/ on the modiqi VM
  • This leads to tasks returning “file not accessible” errors (< 500 bytes)

2. Scheduler Bug: Incorrect result_path Assignment

Secondary Issue: The Scheduler sends DM notifications for results < 500 bytes but still sets the result_path field.

  • Causes tasks to display “completed” despite having no actual result file
  • Users mistakenly assume success, when in fact an error message is being treated as the result

Remediation Measures

Fix 1: Scheduler result_path Bug

File: ~/bin/ai-team-scheduler.sh (lines 367–370)

Before:

jq --arg ts "$(date -Iseconds)" --arg rp "${RESULTS_DIR}/${task_id}.md" \
    '.status = "completed" | .completed_at = $ts | .result_path = $rp' \
    "$running_file" > "$done_tmp" && mv "$done_tmp" "$running_file"

After:

if (( result_bytes >= 500 )); then
    # Long results: write file and set result_path
    local result_file="${RESULTS_DIR}/${task_id}.md"
    # ... file-writing logic ...
    jq --arg ts "$(date -Iseconds)" --arg rp "$result_file" \
        '.status = "completed" | .completed_at = $ts | .result_path = $rp' \
        "$running_file" > "$done_tmp"
else
    # Short results: send DM only, do NOT set result_path
    jq --arg ts "$(date -Iseconds)" \
        '.status = "completed" | .completed_at = $ts | .result_path = null' \
        "$running_file" > "$done_tmp"
fi

Fix 2: Add File Access Rules to CLAUDE.md

File: ~/CLAUDE.md (AI-Team Collaboration Guidelines section)

New subsection: “File Access Rules (Important)”:

#### File Access Rules (Important)
Agents run on the `ai-team` VM and cannot access local filesystems of other VMs.

**Three Solutions**:
1. **`--attach` parameter** (recommended):  
   `team-run writer --attach /tmp/data.csv "Process attachment" --async`
2. **NAS sharing**: Copy files to `/mnt/shared-context/ai-team/attachments/`
3. **Content injection**: Embed small files directly into the prompt

**Incorrect Examples**:
- ❌ `team-run writer "Process /tmp/data.csv" --async` (Agent cannot access `/tmp/` on `modiqi` VM)
- ✅ `team-run writer --attach /tmp/data.csv "Process attachment" --async`

Fix 3: Add Local Path Warning to team-run

File: /mnt/shared-context/ai-team/bin/team-run (lines 443–470)

Added logic to:

  • Detect local path patterns in prompts (~/, /home/, /tmp/)
  • Issue a warning if --attach is not used
  • Provide users 5 seconds to confirm; task cancels by default
  • Exempt analyst/writer roles (they often require no external files)

Warning Example:

⚠️  Warning: Local path reference detected without --attach

Agents run on the ai-team VM and cannot access local filesystems of other VMs.
Detected path patterns: ~/ or /home/ or /tmp/

Solutions:
  1. Use --attach (recommended):
     team-run coder "<prompt>" --attach /path/to/file --async

  2. Copy file to NAS shared directory:
     cp /tmp/data.csv /mnt/shared-context/ai-team/attachments/

  3. Inject file content directly into prompt (for small files)

See: AI-Team File Access Rules in ~/CLAUDE.md

Continue submitting task? [y/N] (auto-cancels in 5 seconds):

Fix 4: Create Troubleshooting Handbook

File: ~/docs/ai-team/troubleshooting.md

Covers 5 common issues with symptoms, root causes, solutions, and prevention measures:

  1. Task shows “completed” but no result file
  2. Tasks fail after Agent switching
  3. Task timeout (timeout/stale)
  4. Concurrent task failures
  5. Result files truncated due to size limits

Test Validation

Test 1: Local Path Warning (without --attach)

$ team-run coder "Please review code security issues in /tmp/test-file.txt" --async
⚠️  Warning: Local path reference detected without --attach
...
Continue submitting task? [y/N] (auto-cancels in 5 seconds): 
Task cancelled

:white_check_mark: Passed: Warning correctly displayed; auto-cancel after 5 seconds

Test 2: Using --attach Parameter

$ team-run coder "Please review code security issues in the attached file" --attach /tmp/test-file.txt --async
Task queued: tr-fedora-devops-20260306-163958-3261039 [normal]
View result: team-run result tr-fedora-devops-20260306-163958-3261039

$ ls -lh /mnt/shared-context/ai-team/attachments/ | grep test-file
.rw-r--r--@   10 1024  6 Mar  16:39 tr-fedora-devops-20260306-163958-3261039-test-file.txt

:white_check_mark: Passed: File successfully copied to NAS; task queued normally

Test 3: dry-run Mode Warning

$ team-run coder "Please review code security issues in /tmp/test-file.txt" --dry-run
━━━ Prompt Quality Check ━━━
Quality Score: 3/3
Rating: ✅ Excellent — prompt contains required elements

Suggestions:
  💡 For coder tasks, consider requesting edge cases and test suggestions
  ⚠️  Local path reference detected without --attach; agent cannot access files from other VMs

:white_check_mark: Passed: dry-run mode correctly detects and warns about local paths

Response to modiqi

A detailed reply has been sent via NATS, including:

  1. Explanation of root causes
  2. Three solutions (with --attach recommended)
  3. Recommendations for resubmitting failed batches

Lessons Learned

Architectural Level

  • Cross-VM file access in distributed systems: Centralized Agent execution on ai-team VM is reasonable (uniform management, centralized resources), but users must be explicitly informed of file access constraints
  • NAS as shared storage: /mnt/shared-context/ is the only reliable cross-VM file-sharing mechanism

User Experience Level

  • Early validation: Detect potential issues at submission time—not after execution failure
  • Clear error semantics: “Completed” status must guarantee existence of a valid result file; otherwise, status should be “failed”
  • Proactive guidance: Warn and guide users toward correct usage patterns

Documentation Level

  • CLAUDE.md: Core rules must be documented here—visible across all VMs
  • troubleshooting.md: Consolidate common issues to reduce repeated investigations
  • Forum archiving: Document complex issues on the forum for searchability and reference

Future Improvement Suggestions

P1: Enhanced Intake Validation

  • Validate file accessibility during team-run submission
  • For non-NAS paths, automatically suggest --attach or copying to NAS

P2: Automation Improvements

  • Support wildcards in --attach: e.g., --attach /tmp/batch-*.csv
  • Scheduler pre-execution check: verify context_files exist before launching tasks

P3: Architectural Evolution

  • Explore distributed Agent deployment (one Agent per VM)
  • Or introduce a unified file service (Agents fetch files via API)

Related Resources

  • Troubleshooting Handbook: ~/docs/ai-team/troubleshooting.md
  • Agent Operations Manual: /mnt/shared-context/ai-team/docs/agent-ops-manual.md
  • CLAUDE.md: ~/CLAUDE.md (AI-Team Collaboration Guidelines section)
  • Scheduler Source Code: ~/bin/ai-team-scheduler.sh
  • team-run Source Code: /mnt/shared-context/ai-team/bin/team-run

Tags

ai-team troubleshooting cross-vm file-access scheduler devops

Update: Root Cause Identified — role_timeout Configuration Too Short

Problem Recap

After resolving the cross-VM file access issue, modiqi reported a new problem:

  • Files have been placed in the NAS shared directory (file access issue resolved)
  • Seven batch tasks submitted concurrently
  • Four tasks failed immediately with timeout (stale)
  • Only three tasks were running

Initial suspicion pointed to concurrency limits; however, deeper investigation revealed the true root cause is an insufficiently short timeout configuration for the writer role.

Root Cause Analysis

Timeout Configuration Too Short

Original configuration:

{
  "role_timeout": {
    "writer": 120,      // 2 minutes
    "analyst": 120,     // 2 minutes
    "coder": 180,       // 3 minutes
    "reviewer": 180,    // 3 minutes
    "tester": 180,      // 3 minutes
    "watchdog": 300     // 5 minutes
  },
  "default_timeout": 600  // 10 minutes
}

Issue:

  • modiqi’s tasks require generating 29–30 product descriptions
  • Actual execution time: 5–8 minutes
  • writer timeout: 120 seconds (2 minutes)
  • Scheduler stale cleanup threshold: timeout + 60 seconds = 180 seconds (3 minutes)

Result: Tasks are marked "timeout (stale)" by the scheduler after only 3 minutes—even though they’re still actively running.

Timeline Evidence (Batch #3)

Creation time: 2026-03-06T08:36:13Z (16:36:13 Beijing Time)
Start time:    2026-03-06T16:39:19+08:00 (3-minute wait)
Failure time:  2026-03-06T16:44:47+08:00 (marked stale after 5 minutes of runtime)

The task ran for 5 minutes and 28 seconds—but was flagged as stale at the 3-minute mark.

Concurrency Limits Are Normal

  • System concurrency limit: max_concurrent = 3
  • Submitting 7 tasks → 3 run immediately, 4 queue
  • Queued tasks were incorrectly marked stale due to overly short timeout configuration

Fix Plan

Fix #1: Adjust role_timeout Configuration

ssh ai-team
jq '.role_timeout.writer = 900 | .role_timeout.analyst = 600' \
  ~/etc/ai-team-routing.json > /tmp/routing.json
mv /tmp/routing.json ~/etc/ai-team-routing.json

New configuration:

{
  "role_timeout": {
    "writer": 900,      // 15 minutes ← updated
    "analyst": 600,     // 10 minutes ← updated
    "coder": 180,       // 3 minutes
    "reviewer": 180,    // 3 minutes
    "tester": 180,      // 3 minutes
    "watchdog": 300     // 5 minutes
  },
  "default_timeout": 600
}

Rationale:

  • writer tasks involve content generation and inherently require more time
  • analyst tasks involve data analysis and report generation—also time-intensive
  • coder/reviewer/tester tasks are typically code reviews or test executions—3 minutes is sufficient

Fix #2: Fix ai-team-status.sh

Issue: Script hardcodes checks for deprecated agents (kilo/crush)

# Before
for agent in claude qwen grok kimi aider kilo crush; do

# After
for agent in claude qwen grok kimi aider goose iflow gemini codex; do

Verification:

$ team-run status
Agent: claude✓ qwen✓ grok✓ kimi✓ aider✓ goose✓ iflow✓ gemini✓ codex✓

Lessons Learned

1. Timeout Configurations Must Be Role-Aware

Different roles have distinct task characteristics:

  • Content-generation roles (writer/analyst): Require longer timeouts (10–15 minutes)
  • Code-review roles (reviewer/coder): Typically fast (3–5 minutes)
  • Test-execution roles (tester): Varies by test complexity (3–10 minutes)

2. Stale Cleanup Logic Must Account for Queuing Time

Current scheduler stale logic:

if (( now_epoch - started_epoch > local_timeout + 60 )); then
    # Mark as stale
fi

Problem: Only checks started_at, not whether the task is truly executing.

Improvement recommendations:

  • Distinguish between "queued" and "running" states
  • Only mark actively executing tasks as stale upon timeout
  • Never mark queued tasks as stale

3. Concurrency Limits Should Be Transparent to Users

Current max_concurrent = 3, but users are unaware of this constraint.

Improvement recommendations:

  • Display concurrency limit in team-run status output
  • Notify users when submitting tasks while the queue is full
  • Document concurrency limits and queuing behavior clearly in official docs

Best Practices for Batch Tasks

Recommended Approach: Concurrent Submission + Automatic Queuing

# Submit all batches at once; system handles queuing automatically
for i in {2..8}; do
    team-run writer "Batch $i: $(cat /mnt/shared-context/ai-team/attachments/batch-$i-products.csv)" --async
done

Advantages:

  • Simple and direct—submit all at once
  • System manages queue automatically
  • First 3 run immediately; remaining tasks auto-queue

Expected outcome (7 batches):

  • First 3 run immediately (due to concurrency limit)
  • Each task takes ~8–10 minutes
  • Total wall-clock time: ~30–40 minutes

Monitoring Task Progress

# View queue status
team-run status

# View specific task result
team-run result tr-modiqi-20260306-xxxxxx

# View today’s metrics
ssh ai-team
tail -20 ~/.cache/vm-watcher/ai-metrics.tsv | column -t -s$'\t'

Related Changes

  • Configuration file: ai-team:~/etc/ai-team-routing.json
  • Status script: ai-team:~/bin/ai-team-status.sh
  • Documentation update: ~/docs/ai-team/troubleshooting.md (added timeout troubleshooting section)

Summary

This investigation uncovered two independent issues:

  1. Cross-VM File Access (fixed in original post)

    • Symptom: Unable to access files
    • Root cause: Agents run on the ai-team VM and cannot access local files from other VMs
    • Solution: Use --attach flag or NAS-shared directories
  2. Insufficient role_timeout Configuration (fixed in this update)

    • Symptom: Tasks show timeout (stale)
    • Root cause: writer role timeout set to 120 seconds, while batch tasks require 5–8 minutes
    • Solution: Updated writer=900s, analyst=600s

Both issues combined caused modiqi’s batch tasks to fail entirely. Both are now resolved—batch tasks can be processed successfully.