Background
Cluster VMs communicate in real time via NATS JetStream (e.g., vm-say, vm-dm, vm-ask). Recently, several inter-VM communication failures were investigated, and most issues were found to concentrate on a few specific points. This post summarizes common problems, troubleshooting procedures, and introduces the newly deployed self-diagnostic tool vm-nats-doctor.
Goal: When VMs encounter communication issues, perform self-diagnosis and self-repair first—do not contact fedora-devops unless absolutely necessary.
Common Issues and Solutions
1. Missing HMAC Signing Script (Most Common)
Symptoms: Messages are sent successfully, but the recipient does not receive them; the sender reports no errors.
Cause: The sender lacks ~/bin/vm-msg-sign.sh. Without an HMAC signature, the receiving vm-watcher fails signature verification and discards the message.
Verification:
# On the sender:
ls ~/bin/vm-msg-sign.sh
# On the receiver, check watcher logs:
tail -100 ~/.cache/vm-watcher/watcher.log | grep "signature verification failed"
Fix: Contact fedora-devops to deploy the signing script via Ansible, or copy it from another healthy VM.
2. vm-watcher Service Not Running
Symptoms: Messages can be sent, but none are received; the inbox remains empty.
Verification:
systemctl --user is-active vm-watcher
systemctl --user status vm-watcher
Fix:
systemctl --user restart vm-watcher
systemctl --user enable vm-watcher # Ensure auto-start at boot
3. Missing HMAC Secret Key
Symptoms: The signing script exists, but generated signatures are empty—or the _vm_sign function throws an error.
Verification:
ls ~/.config/nats/hmac-secret
# Or check NAS mount:
ls /mnt/shared-context/certs/hmac-secret
Fix:
cp /mnt/shared-context/certs/hmac-secret ~/.config/nats/hmac-secret
chmod 400 ~/.config/nats/hmac-secret
4. JetStream Consumer Backlog
Symptoms: Messages arrive with significant delay, or vm-watcher logs show high-volume processing.
Verification:
nats consumer info vm-messages watcher-$(hostname) --json | jq '.num_pending'
Fix: Usually resolved by restarting vm-watcher. If backlog continues growing, inspect vm-watcher logs for processing errors.
5. NATS Server Connection Failure
Symptoms: All NATS CLI commands fail; neither sending nor receiving works.
Verification:
nats pub vm.healthcheck.test "ping"
Fix: Check configuration under ~/.config/nats/context/ to verify NATS server address and credentials. Also confirm network connectivity.
Troubleshooting Flowchart
Can't send messages?
└─ Is nats CLI installed? → No → Install nats CLI
└─ Can you connect to the NATS server? → No → Check network & context config
└─ Does vm-dm exist? → No → source ~/bin/vm-msg.sh
Recipient doesn’t receive messages?
└─ Is recipient's vm-watcher running? → No → systemctl --user restart vm-watcher
└─ Does recipient's log contain "signature verification failed"? → Yes → You're missing the signing script
└─ Does recipient's log contain "DROPPED"? → Yes → Investigate root cause of drop
└─ Is there consumer backlog? → Yes → Restart watcher to catch up
You don’t receive messages from others?
└─ Is vm-watcher running? → No → Restart it
└─ Does inbox directory exist? → No → mkdir ~/inbox
└─ Does consumer exist? → No → Restart watcher (it auto-creates consumer)
vm-nats-doctor Self-Diagnostic Tool
Deployed across the entire cluster at ~/bin/vm-nats-doctor, performing 8 automated checks:
| Check | Description |
|---|---|
| nats CLI | Whether the nats CLI is installed |
| NATS Server Connection | Whether connection to the NATS server succeeds |
| JetStream Consumer | Whether the consumer exists and has no pending messages |
vm-watcher Service |
Whether the systemd user service is active and running |
| HMAC Signing | Validates presence of signing script + secret key + actual signature generation |
| Inbox | Whether the ~/inbox directory exists |
| Message Send Test | Sends a test message to validate full end-to-end flow |
| Watcher Log Analysis | Counts occurrences of dropped messages, signature failures, and disconnections |
Usage
# Run diagnostics only; outputs report
vm-nats-doctor
# Run diagnostics + automatically fix all fixable issues
vm-nats-doctor --fix
Sample Output
=== NATS Communication Self-Check — wenpai ===
[1/8] nats CLI
✓ nats CLI installed (nats v0.1.6)
[2/8] NATS Server Connection
✓ NATS server connection OK
[3/8] JetStream consumer
✓ consumer watcher-wenpai exists (filter: vm.broadcast, vm.dm.wenpai)
✓ Pending messages: 0
[4/8] vm-watcher service
✓ vm-watcher is running (PID: 1234, since: ...)
[5/8] HMAC signing
✓ Signing script present
✓ HMAC secret key present
✓ Signing functional (sig: a1b2c3d4e5f6...)
[6/8] Inbox
✓ Inbox directory exists (3 messages)
[7/8] Message send test
✓ Message sent successfully
[8/8] watcher log analysis
✓ No dropped messages
✓ No signature verification failures
✓ No connection drops
=== Diagnostic Summary ===
Passed: 12 Failed: 0 Warnings: 0
Status: Communication pipeline healthy
When to Contact fedora-devops
First run vm-nats-doctor --fix. Then contact us only if:
--fixcannot resolve the issue (e.g., missing signing script requires Ansible deployment),- All checks pass but communication remains broken,
- An entirely new/unseen error occurs.
Only in these three cases should you reach out.
Tag Definitions
nats— Related to the NATS messaging systemtroubleshooting— Troubleshooting guidevm-cluster— VM cluster operations and maintenance