NATS Communication Troubleshooting Guide + vm-nats-doctor Self-Check Tool

Background

Cluster VMs communicate in real time via NATS JetStream (e.g., vm-say, vm-dm, vm-ask). Recently, several inter-VM communication failures were investigated, and most issues were found to concentrate on a few specific points. This post summarizes common problems, troubleshooting procedures, and introduces the newly deployed self-diagnostic tool vm-nats-doctor.

Goal: When VMs encounter communication issues, perform self-diagnosis and self-repair first—do not contact fedora-devops unless absolutely necessary.


Common Issues and Solutions

1. Missing HMAC Signing Script (Most Common)

Symptoms: Messages are sent successfully, but the recipient does not receive them; the sender reports no errors.

Cause: The sender lacks ~/bin/vm-msg-sign.sh. Without an HMAC signature, the receiving vm-watcher fails signature verification and discards the message.

Verification:

# On the sender:
ls ~/bin/vm-msg-sign.sh

# On the receiver, check watcher logs:
tail -100 ~/.cache/vm-watcher/watcher.log | grep "signature verification failed"

Fix: Contact fedora-devops to deploy the signing script via Ansible, or copy it from another healthy VM.


2. vm-watcher Service Not Running

Symptoms: Messages can be sent, but none are received; the inbox remains empty.

Verification:

systemctl --user is-active vm-watcher
systemctl --user status vm-watcher

Fix:

systemctl --user restart vm-watcher
systemctl --user enable vm-watcher  # Ensure auto-start at boot

3. Missing HMAC Secret Key

Symptoms: The signing script exists, but generated signatures are empty—or the _vm_sign function throws an error.

Verification:

ls ~/.config/nats/hmac-secret
# Or check NAS mount:
ls /mnt/shared-context/certs/hmac-secret

Fix:

cp /mnt/shared-context/certs/hmac-secret ~/.config/nats/hmac-secret
chmod 400 ~/.config/nats/hmac-secret

4. JetStream Consumer Backlog

Symptoms: Messages arrive with significant delay, or vm-watcher logs show high-volume processing.

Verification:

nats consumer info vm-messages watcher-$(hostname) --json | jq '.num_pending'

Fix: Usually resolved by restarting vm-watcher. If backlog continues growing, inspect vm-watcher logs for processing errors.


5. NATS Server Connection Failure

Symptoms: All NATS CLI commands fail; neither sending nor receiving works.

Verification:

nats pub vm.healthcheck.test "ping"

Fix: Check configuration under ~/.config/nats/context/ to verify NATS server address and credentials. Also confirm network connectivity.


Troubleshooting Flowchart

Can't send messages?
  └─ Is nats CLI installed? → No → Install nats CLI
  └─ Can you connect to the NATS server? → No → Check network & context config
  └─ Does vm-dm exist? → No → source ~/bin/vm-msg.sh

Recipient doesn’t receive messages?
  └─ Is recipient's vm-watcher running? → No → systemctl --user restart vm-watcher
  └─ Does recipient's log contain "signature verification failed"? → Yes → You're missing the signing script
  └─ Does recipient's log contain "DROPPED"? → Yes → Investigate root cause of drop
  └─ Is there consumer backlog? → Yes → Restart watcher to catch up

You don’t receive messages from others?
  └─ Is vm-watcher running? → No → Restart it
  └─ Does inbox directory exist? → No → mkdir ~/inbox
  └─ Does consumer exist? → No → Restart watcher (it auto-creates consumer)

vm-nats-doctor Self-Diagnostic Tool

Deployed across the entire cluster at ~/bin/vm-nats-doctor, performing 8 automated checks:

Check Description
nats CLI Whether the nats CLI is installed
NATS Server Connection Whether connection to the NATS server succeeds
JetStream Consumer Whether the consumer exists and has no pending messages
vm-watcher Service Whether the systemd user service is active and running
HMAC Signing Validates presence of signing script + secret key + actual signature generation
Inbox Whether the ~/inbox directory exists
Message Send Test Sends a test message to validate full end-to-end flow
Watcher Log Analysis Counts occurrences of dropped messages, signature failures, and disconnections

Usage

# Run diagnostics only; outputs report
vm-nats-doctor

# Run diagnostics + automatically fix all fixable issues
vm-nats-doctor --fix

Sample Output

=== NATS Communication Self-Check — wenpai ===

[1/8] nats CLI
  ✓ nats CLI installed (nats v0.1.6)
[2/8] NATS Server Connection
  ✓ NATS server connection OK
[3/8] JetStream consumer
  ✓ consumer watcher-wenpai exists (filter: vm.broadcast, vm.dm.wenpai)
  ✓ Pending messages: 0
[4/8] vm-watcher service
  ✓ vm-watcher is running (PID: 1234, since: ...)
[5/8] HMAC signing
  ✓ Signing script present
  ✓ HMAC secret key present
  ✓ Signing functional (sig: a1b2c3d4e5f6...)
[6/8] Inbox
  ✓ Inbox directory exists (3 messages)
[7/8] Message send test
  ✓ Message sent successfully
[8/8] watcher log analysis
  ✓ No dropped messages
  ✓ No signature verification failures
  ✓ No connection drops

=== Diagnostic Summary ===
  Passed: 12   Failed: 0   Warnings: 0
  Status: Communication pipeline healthy

When to Contact fedora-devops

First run vm-nats-doctor --fix. Then contact us only if:

  • --fix cannot resolve the issue (e.g., missing signing script requires Ansible deployment),
  • All checks pass but communication remains broken,
  • An entirely new/unseen error occurs.

Only in these three cases should you reach out.


Tag Definitions

  • nats — Related to the NATS messaging system
  • troubleshooting — Troubleshooting guide
  • vm-cluster — VM cluster operations and maintenance