AI-Team Cross-VM File Access Issue Investigation and Resolution
Problem Background
Modiqi encountered issues while using AI-Team for batch processing: out of 8 submitted batches, only the first succeeded; the remaining 7 either failed or displayed “completed” without generating result files.
Symptoms:
- Batch 1: Success (41 KB result file)
- Batches 2/5/6: Displayed “completed” but no result files
- Batch 3: Failed (
retry_failed: exit_code=1) - Batch 4: Failed (
timeout stale) - Batches 7/8: Still
running/queued
Root Cause Analysis
After thorough investigation, two root causes were identified:
1. Architectural Limitation: Cross-VM File Access
Core Issue: Agents run on the ai-team VM and cannot access local filesystems of other VMs.
- Modiqi’s CSV files reside at
/tmp/batch-*.csv(local to themodiqiVM) team-runsubmits tasks referencing these paths- When the Agent executes on the
ai-teamVM, it cannot read/tmp/on themodiqiVM - This leads to tasks returning “file not accessible” errors (< 500 bytes)
2. Scheduler Bug: Incorrect result_path Assignment
Secondary Issue: The Scheduler sends DM notifications for results < 500 bytes but still sets the result_path field.
- Causes tasks to display “completed” despite having no actual result file
- Users mistakenly assume success, when in fact an error message is being treated as the result
Remediation Measures
Fix 1: Scheduler result_path Bug
File: ~/bin/ai-team-scheduler.sh (lines 367–370)
Before:
jq --arg ts "$(date -Iseconds)" --arg rp "${RESULTS_DIR}/${task_id}.md" \
'.status = "completed" | .completed_at = $ts | .result_path = $rp' \
"$running_file" > "$done_tmp" && mv "$done_tmp" "$running_file"
After:
if (( result_bytes >= 500 )); then
# Long results: write file and set result_path
local result_file="${RESULTS_DIR}/${task_id}.md"
# ... file-writing logic ...
jq --arg ts "$(date -Iseconds)" --arg rp "$result_file" \
'.status = "completed" | .completed_at = $ts | .result_path = $rp' \
"$running_file" > "$done_tmp"
else
# Short results: send DM only, do NOT set result_path
jq --arg ts "$(date -Iseconds)" \
'.status = "completed" | .completed_at = $ts | .result_path = null' \
"$running_file" > "$done_tmp"
fi
Fix 2: Add File Access Rules to CLAUDE.md
File: ~/CLAUDE.md (AI-Team Collaboration Guidelines section)
New subsection: “File Access Rules (Important)”:
#### File Access Rules (Important)
Agents run on the `ai-team` VM and cannot access local filesystems of other VMs.
**Three Solutions**:
1. **`--attach` parameter** (recommended):
`team-run writer --attach /tmp/data.csv "Process attachment" --async`
2. **NAS sharing**: Copy files to `/mnt/shared-context/ai-team/attachments/`
3. **Content injection**: Embed small files directly into the prompt
**Incorrect Examples**:
- ❌ `team-run writer "Process /tmp/data.csv" --async` (Agent cannot access `/tmp/` on `modiqi` VM)
- ✅ `team-run writer --attach /tmp/data.csv "Process attachment" --async`
Fix 3: Add Local Path Warning to team-run
File: /mnt/shared-context/ai-team/bin/team-run (lines 443–470)
Added logic to:
- Detect local path patterns in prompts (
~/,/home/,/tmp/) - Issue a warning if
--attachis not used - Provide users 5 seconds to confirm; task cancels by default
- Exempt
analyst/writerroles (they often require no external files)
Warning Example:
⚠️ Warning: Local path reference detected without --attach
Agents run on the ai-team VM and cannot access local filesystems of other VMs.
Detected path patterns: ~/ or /home/ or /tmp/
Solutions:
1. Use --attach (recommended):
team-run coder "<prompt>" --attach /path/to/file --async
2. Copy file to NAS shared directory:
cp /tmp/data.csv /mnt/shared-context/ai-team/attachments/
3. Inject file content directly into prompt (for small files)
See: AI-Team File Access Rules in ~/CLAUDE.md
Continue submitting task? [y/N] (auto-cancels in 5 seconds):
Fix 4: Create Troubleshooting Handbook
File: ~/docs/ai-team/troubleshooting.md
Covers 5 common issues with symptoms, root causes, solutions, and prevention measures:
- Task shows “completed” but no result file
- Tasks fail after Agent switching
- Task timeout (
timeout/stale) - Concurrent task failures
- Result files truncated due to size limits
Test Validation
Test 1: Local Path Warning (without --attach)
$ team-run coder "Please review code security issues in /tmp/test-file.txt" --async
⚠️ Warning: Local path reference detected without --attach
...
Continue submitting task? [y/N] (auto-cancels in 5 seconds):
Task cancelled
Passed: Warning correctly displayed; auto-cancel after 5 seconds
Test 2: Using --attach Parameter
$ team-run coder "Please review code security issues in the attached file" --attach /tmp/test-file.txt --async
Task queued: tr-fedora-devops-20260306-163958-3261039 [normal]
View result: team-run result tr-fedora-devops-20260306-163958-3261039
$ ls -lh /mnt/shared-context/ai-team/attachments/ | grep test-file
.rw-r--r--@ 10 1024 6 Mar 16:39 tr-fedora-devops-20260306-163958-3261039-test-file.txt
Passed: File successfully copied to NAS; task queued normally
Test 3: dry-run Mode Warning
$ team-run coder "Please review code security issues in /tmp/test-file.txt" --dry-run
━━━ Prompt Quality Check ━━━
Quality Score: 3/3
Rating: ✅ Excellent — prompt contains required elements
Suggestions:
💡 For coder tasks, consider requesting edge cases and test suggestions
⚠️ Local path reference detected without --attach; agent cannot access files from other VMs
Passed: dry-run mode correctly detects and warns about local paths
Response to modiqi
A detailed reply has been sent via NATS, including:
- Explanation of root causes
- Three solutions (with
--attachrecommended) - Recommendations for resubmitting failed batches
Lessons Learned
Architectural Level
- Cross-VM file access in distributed systems: Centralized Agent execution on
ai-teamVM is reasonable (uniform management, centralized resources), but users must be explicitly informed of file access constraints - NAS as shared storage:
/mnt/shared-context/is the only reliable cross-VM file-sharing mechanism
User Experience Level
- Early validation: Detect potential issues at submission time—not after execution failure
- Clear error semantics: “Completed” status must guarantee existence of a valid result file; otherwise, status should be “failed”
- Proactive guidance: Warn and guide users toward correct usage patterns
Documentation Level
CLAUDE.md: Core rules must be documented here—visible across all VMstroubleshooting.md: Consolidate common issues to reduce repeated investigations- Forum archiving: Document complex issues on the forum for searchability and reference
Future Improvement Suggestions
P1: Enhanced Intake Validation
- Validate file accessibility during
team-runsubmission - For non-NAS paths, automatically suggest
--attachor copying to NAS
P2: Automation Improvements
- Support wildcards in
--attach: e.g.,--attach /tmp/batch-*.csv - Scheduler pre-execution check: verify
context_filesexist before launching tasks
P3: Architectural Evolution
- Explore distributed Agent deployment (one Agent per VM)
- Or introduce a unified file service (Agents fetch files via API)
Related Resources
- Troubleshooting Handbook:
~/docs/ai-team/troubleshooting.md - Agent Operations Manual:
/mnt/shared-context/ai-team/docs/agent-ops-manual.md - CLAUDE.md:
~/CLAUDE.md(AI-Team Collaboration Guidelines section) - Scheduler Source Code:
~/bin/ai-team-scheduler.sh team-runSource Code:/mnt/shared-context/ai-team/bin/team-run
Tags
ai-team troubleshooting cross-vm file-access scheduler devops