Agent Monitoring Guide¶

This comprehensive guide covers monitoring distributed agents in KrakenHashes, including real-time metrics, health checks, performance monitoring, and troubleshooting strategies for administrators managing agent fleets.

Table of Contents¶

Overview
Real-time Agent Status
Heartbeat and Connection Monitoring
Metrics Collection and Monitoring
Device Performance Monitoring
Agent Health Indicators
Multi-Agent Fleet Monitoring
Status Indicators and States
Performance Analysis and Trends
Troubleshooting with Monitoring Data
Best Practices

Overview¶

KrakenHashes provides comprehensive monitoring capabilities for distributed agents, enabling administrators to track agent health, performance, and operational status across their entire fleet. The monitoring system includes real-time metrics collection, WebSocket-based status reporting, and detailed performance analytics.

Key Monitoring Features¶

Real-time Agent Status: Live connection status and heartbeat monitoring
Device Metrics: GPU temperature, utilization, fan speed, and hash rate tracking
WebSocket Communication: Persistent connections with automatic reconnection
Performance Analytics: Historical data and trend analysis
Multi-Agent Dashboard: Fleet-wide visibility and management
Automated Health Checks: Connection validation and failure detection

Real-time Agent Status¶

Agent Status Overview¶

The system continuously monitors agent status through multiple channels:

Active Agents: Online and ready for work
├── Connected: WebSocket connection established
├── Idle: Available for task assignment
├── Busy: Currently executing tasks
└── Reconnecting: Temporary disconnection with recovery in progress

Inactive Agents: Not currently operational
├── Offline: No recent heartbeat or connection
├── Disabled: Administratively disabled
├── Error: Failed connection or system error
└── Pending: Newly registered, awaiting first connection

Agent Connection States¶

The monitoring system tracks detailed connection states:

State	Description	Color Indicator	Action Required
`active`	Connected and operational	🟢 Green	None
`inactive`	Disconnected or offline	🔴 Red	Check agent service
`pending`	Registration in progress	🟡 Yellow	Wait for completion
`disabled`	Administratively disabled	⚫ Gray	Manual re-enable
`error`	System or hardware error	🔴 Red	Investigate error

Agent Dashboard View¶

Access the agent monitoring dashboard at: - Frontend: https://your-server:31337/agents - Agent Details: https://your-server:31337/agents/{agent_id}

The dashboard provides: - Real-time connection status - Last activity timestamps
- Device configuration and status - Performance metrics and charts - Task assignment history

Heartbeat and Connection Monitoring¶

WebSocket Heartbeat System¶

KrakenHashes uses a robust WebSocket-based heartbeat system for monitoring agent connectivity:

Backend ←→ Agent WebSocket Connection
├── Ping/Pong Messages: Every 54 seconds (configurable)
├── Agent Status Updates: Every 60 seconds
├── Heartbeat Timeout: 60 seconds maximum
└── Automatic Reconnection: Exponential backoff (1s to 30s)

Connection Timing Configuration¶

The system uses configurable timing parameters:

# Backend WebSocket Settings (environment variables)
KH_WRITE_WAIT=10s      # Write operation timeout
KH_PONG_WAIT=60s       # Pong response timeout  
KH_PING_PERIOD=54s     # Ping interval

# Agent automatically fetches these from backend
# No manual configuration required

Heartbeat Monitoring Queries¶

Monitor agent heartbeat status using database queries:

-- Agents with recent heartbeats (last 5 minutes)
SELECT 
    id,
    name,
    status,
    last_heartbeat,
    EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) AS seconds_since_heartbeat
FROM agents 
WHERE last_heartbeat > NOW() - INTERVAL '5 minutes'
ORDER BY last_heartbeat DESC;

-- Agents with stale heartbeats (potential issues)
SELECT 
    id,
    name, 
    status,
    last_heartbeat,
    EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) AS seconds_since_heartbeat
FROM agents
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes'
  AND status = 'active'
ORDER BY last_heartbeat ASC;

-- Connection status distribution
SELECT status, COUNT(*) as count
FROM agents
GROUP BY status
ORDER BY count DESC;

Metrics Collection and Monitoring¶

System Metrics Collection¶

Agents automatically collect and report system metrics:

CPU and Memory Metrics¶

CPU Usage: Overall processor utilization percentage
Memory Usage: System memory utilization percentage
Collection Interval: 5 seconds (configurable)
Data Retention: Based on monitoring settings

Agent Metrics Structure¶

type MetricsData struct {
    AgentID     int                `json:"agent_id"`
    CollectedAt time.Time          `json:"collected_at"`
    CPUs        []CPUMetrics       `json:"cpus"`
    GPUs        []GPUMetrics       `json:"gpus"`
    Memory      MemoryMetrics      `json:"memory"`
    Disk        []DiskMetrics      `json:"disk"`
    Network     []NetworkMetrics   `json:"network"`
    Process     []ProcessMetrics   `json:"process"`
}

GPU Metrics Integration¶

GPU metrics are obtained from hashcat's JSON status output during job execution:

GPU Utilization: Device usage percentage
GPU Temperature: Operating temperature in Celsius
GPU Memory: Memory utilization
Power Usage: Power consumption in watts
Hash Rate: Real-time hashing performance

Metrics Storage and Retention¶

The system implements a cascading retention policy:

Real-time Metrics (Fine-grained)
├── Retention: 7 days (configurable)  
├── Resolution: 5-second intervals
└── Use: Recent activity monitoring

Daily Aggregates (Medium-term)
├── Retention: 30 days (configurable)
├── Resolution: Daily summaries
└── Use: Performance analysis

Weekly Aggregates (Long-term)  
├── Retention: 365 days (configurable)
├── Resolution: Weekly summaries
└── Use: Historical trends

Monitoring Settings Configuration¶

Configure metrics retention through the admin interface:

# Access monitoring settings
https://your-server:31337/admin/monitoring

# Available settings:
- Real-time Data Retention: 7 days
- Daily Aggregates Retention: 30 days  
- Weekly Aggregates Retention: 365 days
- Enable Aggregation: true
- Aggregation Interval: daily

Device Performance Monitoring¶

Real-time Device Metrics¶

The Agent Details page provides comprehensive device monitoring with live charts:

Available Device Charts¶

Temperature Monitoring
Real-time GPU temperature tracking
Temperature threshold alerts
Historical temperature trends
Multi-device comparison
Utilization Tracking
GPU utilization percentage
Device workload distribution
Efficiency analysis
Performance optimization insights
Fan Speed Monitoring
Cooling system performance
Fan curve analysis
Thermal management tracking
Hardware health indicators
Hash Rate Performance
Real-time hashing performance
Per-device contribution
Cumulative hash rate
Performance benchmarking

Chart Configuration Options¶

// Time Range Options
const timeRanges = [
    '10m',  // 10 minutes
    '20m',  // 20 minutes  
    '1h',   // 1 hour
    '5h',   // 5 hours
    '24h'   // 24 hours
];

// Metric Types
const metricTypes = [
    'temperature',  // GPU temperature in °C
    'utilization',  // GPU utilization %
    'fanspeed',     // Fan speed %
    'hashrate'      // Hash rate (varies by algorithm)
];

Device Metrics API Endpoints¶

# Get device metrics for agent
GET /api/agents/{agent_id}/metrics?timeRange=1h&metrics=temperature,utilization,fanspeed,hashrate

# Response format
{
    "devices": [
        {
            "deviceId": 0,
            "deviceName": "NVIDIA RTX 4090",
            "metrics": {
                "temperature": [
                    {"timestamp": 1640995200000, "value": 65.0},
                    {"timestamp": 1640995205000, "value": 67.2}
                ],
                "utilization": [
                    {"timestamp": 1640995200000, "value": 98.5},
                    {"timestamp": 1640995205000, "value": 99.1}
                ]
            }
        }
    ]
}

Device Enable/Disable Monitoring¶

Monitor device status changes through the interface:

# Update device status
PUT /api/agents/{agent_id}/devices/{device_id}
{
    "enabled": true
}

# Monitor device state changes in logs
grep -i "device.*update.*enabled" logs/backend/*.log

Agent Health Indicators¶

Connection Health Metrics¶

Monitor agent connection health through multiple indicators:

Connection Status Indicators¶

WebSocket State: Connected/Disconnected
Last Heartbeat: Timestamp of last communication
Response Time: WebSocket ping-pong latency
Reconnection Attempts: Failed connection retry count

System Health Indicators¶

CPU Load: System processor utilization
Memory Usage: Available system memory
Disk Space: Storage availability
Network Latency: Communication delays

Hardware Health Indicators¶

GPU Temperature: Thermal status and limits
GPU Utilization: Device workload efficiency
Power Consumption: Electrical usage monitoring
Fan Performance: Cooling system operation

Agent Status Reporting¶

Agents automatically report detailed status information:

{
    "status": "active",
    "version": "v0.15.7",  
    "updated_at": "2025-09-11T10:30:00Z",
    "environment": {
        "os": "linux",
        "arch": "amd64",
        "hostname": "worker-01"
    },
    "os_info": {
        "platform": "linux",
        "hostname": "worker-01", 
        "os_name": "Ubuntu",
        "os_version": "22.04.3 LTS",
        "kernel_version": "Linux version 6.5.0",
        "go_version": "go1.21.0"
    }
}

Health Check Queries¶

-- Agent health summary
SELECT 
    a.name,
    a.status,
    a.last_heartbeat,
    a.version,
    CASE 
        WHEN a.last_heartbeat > NOW() - INTERVAL '2 minutes' THEN 'Healthy'
        WHEN a.last_heartbeat > NOW() - INTERVAL '5 minutes' THEN 'Warning'
        ELSE 'Critical'
    END as health_status,
    COUNT(ad.id) as device_count,
    SUM(CASE WHEN ad.enabled THEN 1 ELSE 0 END) as enabled_devices
FROM agents a
LEFT JOIN agent_devices ad ON a.id = ad.agent_id
GROUP BY a.id, a.name, a.status, a.last_heartbeat, a.version
ORDER BY a.last_heartbeat DESC;

-- Agents with hardware issues
SELECT 
    a.name,
    ad.device_name,
    apm.metric_type,
    apm.value,
    apm.timestamp
FROM agents a
JOIN agent_devices ad ON a.id = ad.agent_id
JOIN agent_performance_metrics apm ON ad.agent_id = apm.agent_id
WHERE (apm.metric_type = 'temperature' AND apm.value > 85)
   OR (apm.metric_type = 'utilization' AND apm.value < 50)
ORDER BY apm.timestamp DESC;

Multi-Agent Fleet Monitoring¶

Fleet Overview Dashboard¶

Monitor your entire agent fleet from the main agents page:

# Access fleet monitoring
https://your-server:31337/agents

# Key fleet metrics:
- Total Agents: Active + Inactive count
- Agent Distribution: By status and location  
- Hardware Summary: Total GPU count and types
- Performance Metrics: Combined hash rates
- Health Status: Overall fleet health

Fleet Status Categories¶

Active Agents¶

Online and Ready: Available for job assignment
Busy: Currently executing tasks
Idle: Connected but not actively working

Inactive Agents¶

Offline: No recent heartbeat
Disabled: Administratively disabled
Error State: Hardware or connection issues

Pending Agents¶

Registering: New agent setup in progress
Authenticating: Certificate and API key validation

Fleet-wide Monitoring Queries¶

-- Fleet status summary
SELECT 
    status,
    COUNT(*) as agent_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM agents
GROUP BY status
ORDER BY agent_count DESC;

-- Fleet hardware summary
SELECT 
    ad.device_type,
    COUNT(DISTINCT a.id) as agents_with_device,
    COUNT(ad.id) as total_devices,
    SUM(CASE WHEN ad.enabled THEN 1 ELSE 0 END) as enabled_devices
FROM agents a
JOIN agent_devices ad ON a.id = ad.agent_id
GROUP BY ad.device_type
ORDER BY total_devices DESC;

-- Fleet performance overview
SELECT 
    COUNT(DISTINCT a.id) as total_agents,
    COUNT(DISTINCT CASE WHEN a.status = 'active' THEN a.id END) as active_agents,
    AVG(CASE WHEN apm.metric_type = 'utilization' THEN apm.value END) as avg_gpu_utilization,
    AVG(CASE WHEN apm.metric_type = 'temperature' THEN apm.value END) as avg_gpu_temperature
FROM agents a
LEFT JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.timestamp > NOW() - INTERVAL '1 hour';

Geographic and Organizational Fleet Monitoring¶

-- Agents by network location (using IP metadata)
SELECT 
    SUBSTRING(metadata->>'ipAddress', 1, 
              POSITION('.' IN metadata->>'ipAddress' || '.')) as network_prefix,
    COUNT(*) as agent_count,
    COUNT(CASE WHEN status = 'active' THEN 1 END) as active_count
FROM agents
WHERE metadata ? 'ipAddress'
GROUP BY network_prefix
ORDER BY agent_count DESC;

-- Agents by owner (team/user assignments)
SELECT 
    COALESCE(u.username, 'Unassigned') as owner,
    COUNT(a.id) as agent_count,
    COUNT(CASE WHEN a.status = 'active' THEN 1 END) as active_count
FROM agents a
LEFT JOIN users u ON a.owner_id = u.id  
GROUP BY u.username
ORDER BY agent_count DESC;

Status Indicators and States¶

Agent Status State Machine¶

[pending] → [active] → [inactive]
    ↓         ↓           ↓
[error]   [busy/idle]  [disabled]
    ↓         ↓           ↓
[active]  [active]   [active]

Detailed Status Descriptions¶

Status	Description	Typical Causes	Recovery Actions
`pending`	New agent registration	First-time setup	Wait for completion
`active`	Fully operational	Normal state	None required
`busy`	Executing tasks	Job assignment	Monitor progress
`idle`	Connected, available	Between jobs	None required
`inactive`	Disconnected	Network/service issues	Restart agent service
`disabled`	Manually disabled	Admin action	Re-enable through UI
`error`	System/hardware error	Hardware failure, config error	Check logs, fix issues

Status Transition Triggers¶

Automatic Transitions¶

pending → active: Successful device detection and registration
active → inactive: Heartbeat timeout (>5 minutes)
active → error: Device detection failure or system error
error → active: Successful reconnection after error resolution

Manual Transitions¶

Any → disabled: Administrative disable action
disabled → active: Administrative enable action
Any → active: Force status change through API

Visual Status Indicators¶

The web interface uses consistent visual indicators:

/* Status indicator colors */
.status-active     { color: #4caf50; }  /* Green */
.status-inactive   { color: #f44336; }  /* Red */
.status-pending    { color: #ff9800; }  /* Orange */
.status-disabled   { color: #9e9e9e; }  /* Gray */
.status-error      { color: #f44336; }  /* Red */

Performance Analysis and Trends¶

Historical Performance Tracking¶

The system maintains comprehensive historical performance data for trend analysis:

Performance Metrics Database Schema¶

-- Agent performance metrics table structure
CREATE TABLE agent_performance_metrics (
    id SERIAL PRIMARY KEY,
    agent_id INTEGER REFERENCES agents(id),
    device_name VARCHAR(255),
    metric_type VARCHAR(50),  -- 'temperature', 'utilization', 'fanspeed', 'hashrate'
    value NUMERIC(10,2),
    timestamp TIMESTAMP DEFAULT NOW()
);

-- Benchmark results tracking
CREATE TABLE agent_benchmarks (
    id SERIAL PRIMARY KEY,
    agent_id INTEGER REFERENCES agents(id),
    attack_mode INTEGER,
    hash_type INTEGER, 
    speed BIGINT,
    device_speeds JSONB,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

Performance Analysis Queries¶

-- Agent performance trends over time
SELECT 
    DATE_TRUNC('hour', timestamp) as hour,
    AVG(CASE WHEN metric_type = 'utilization' THEN value END) as avg_utilization,
    AVG(CASE WHEN metric_type = 'temperature' THEN value END) as avg_temperature,
    AVG(CASE WHEN metric_type = 'hashrate' THEN value END) as avg_hashrate
FROM agent_performance_metrics
WHERE agent_id = $1
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;

-- Top performing agents by hash rate
SELECT 
    a.name,
    AVG(apm.value) as avg_hashrate,
    MAX(apm.value) as peak_hashrate,
    COUNT(apm.id) as measurement_count
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.metric_type = 'hashrate'
  AND apm.timestamp > NOW() - INTERVAL '1 week'
GROUP BY a.id, a.name
ORDER BY avg_hashrate DESC
LIMIT 10;

-- Performance degradation detection
WITH recent_performance AS (
    SELECT 
        agent_id,
        AVG(value) as recent_avg
    FROM agent_performance_metrics
    WHERE metric_type = 'hashrate'
      AND timestamp > NOW() - INTERVAL '24 hours'
    GROUP BY agent_id
),
baseline_performance AS (
    SELECT 
        agent_id,
        AVG(value) as baseline_avg
    FROM agent_performance_metrics  
    WHERE metric_type = 'hashrate'
      AND timestamp BETWEEN NOW() - INTERVAL '1 week' AND NOW() - INTERVAL '2 days'
    GROUP BY agent_id
)
SELECT 
    a.name,
    r.recent_avg,
    b.baseline_avg,
    ROUND(((r.recent_avg - b.baseline_avg) / b.baseline_avg * 100), 2) as percent_change
FROM agents a
JOIN recent_performance r ON a.id = r.agent_id
JOIN baseline_performance b ON a.id = b.agent_id
WHERE ABS((r.recent_avg - b.baseline_avg) / b.baseline_avg) > 0.15  -- >15% change
ORDER BY percent_change;

Benchmark Performance Tracking¶

Monitor agent benchmark performance over time:

-- Benchmark history for agent
SELECT 
    attack_mode,
    hash_type,
    speed,
    created_at,
    LAG(speed) OVER (PARTITION BY attack_mode, hash_type ORDER BY created_at) as previous_speed,
    speed - LAG(speed) OVER (PARTITION BY attack_mode, hash_type ORDER BY created_at) as speed_change
FROM agent_benchmarks
WHERE agent_id = $1
ORDER BY created_at DESC;

-- Fleet benchmark comparison
SELECT 
    a.name,
    ab.attack_mode,
    ab.hash_type,
    ab.speed,
    RANK() OVER (PARTITION BY ab.attack_mode, ab.hash_type ORDER BY ab.speed DESC) as rank
FROM agents a
JOIN agent_benchmarks ab ON a.id = ab.agent_id
WHERE ab.updated_at > NOW() - INTERVAL '1 week'
ORDER BY ab.attack_mode, ab.hash_type, ab.speed DESC;

Troubleshooting with Monitoring Data¶

Common Issues and Diagnostic Approaches¶

1. Agent Connection Issues¶

Symptoms: - Agent status shows "inactive" or "error" - Missing from active agents list - Stale heartbeat timestamps

Diagnostic Steps:

-- Check agent connection history
SELECT 
    id,
    name,
    status,
    last_heartbeat,
    last_error,
    EXTRACT(EPOCH FROM (NOW() - last_heartbeat)) as seconds_offline
FROM agents
WHERE name = 'problem-agent'
OR id = 123;

-- Check for recent WebSocket errors

Log Analysis:

# Check agent logs for connection issues
grep -i "connection\|websocket\|heartbeat" /path/to/agent.log

# Check backend logs for agent-related errors
grep -i "agent.*error\|websocket.*close\|connection.*failed" logs/backend/*.log

# Look for certificate or authentication issues
grep -i "certificate\|auth.*fail\|tls.*error" logs/backend/*.log

2. Performance Degradation¶

Symptoms: - Decreased hash rates - Higher GPU temperatures - Reduced utilization

Diagnostic Queries:

-- Performance comparison (current vs historical)
WITH current_perf AS (
    SELECT AVG(value) as current_hashrate
    FROM agent_performance_metrics
    WHERE agent_id = $1 
      AND metric_type = 'hashrate'
      AND timestamp > NOW() - INTERVAL '1 hour'
),
historical_perf AS (
    SELECT AVG(value) as historical_hashrate
    FROM agent_performance_metrics
    WHERE agent_id = $1
      AND metric_type = 'hashrate' 
      AND timestamp BETWEEN NOW() - INTERVAL '1 week' AND NOW() - INTERVAL '1 day'
)
SELECT 
    c.current_hashrate,
    h.historical_hashrate,
    ((c.current_hashrate - h.historical_hashrate) / h.historical_hashrate * 100) as percent_change
FROM current_perf c, historical_perf h;

-- Temperature analysis
SELECT 
    DATE_TRUNC('hour', timestamp) as hour,
    AVG(value) as avg_temp,
    MAX(value) as max_temp,
    COUNT(*) as measurements
FROM agent_performance_metrics
WHERE agent_id = $1
  AND metric_type = 'temperature'
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;

3. Hardware Health Issues¶

Symptoms: - High GPU temperatures (>85°C) - Fan speed abnormalities - Utilization inconsistencies

Monitoring Approach:

-- Hardware health check
SELECT 
    device_name,
    metric_type,
    value,
    timestamp,
    CASE 
        WHEN metric_type = 'temperature' AND value > 85 THEN 'CRITICAL'
        WHEN metric_type = 'temperature' AND value > 75 THEN 'WARNING'
        WHEN metric_type = 'utilization' AND value < 80 THEN 'LOW_UTIL'
        WHEN metric_type = 'fanspeed' AND value > 90 THEN 'HIGH_FAN'
        ELSE 'NORMAL'
    END as status
FROM agent_performance_metrics
WHERE agent_id = $1
  AND timestamp > NOW() - INTERVAL '1 hour'
  AND (
    (metric_type = 'temperature' AND value > 75)
    OR (metric_type = 'utilization' AND value < 80)
    OR (metric_type = 'fanspeed' AND value > 90)
  )
ORDER BY timestamp DESC;

4. Task Assignment Issues¶

Symptoms: - Agents remain idle during jobs - Uneven task distribution - Tasks stuck in "reconnect_pending"

Diagnostic Queries:

-- Check agent busy status and current tasks
SELECT 
    a.name,
    a.status,
    a.metadata->>'busy_status' as busy_status,
    a.metadata->>'current_task_id' as current_task,
    jt.status as task_status,
    je.name as job_name
FROM agents a
LEFT JOIN job_tasks jt ON jt.id = a.metadata->>'current_task_id'  
LEFT JOIN job_executions je ON jt.job_execution_id = je.id
WHERE a.id = $1;

-- Check for reconnect_pending tasks
SELECT 
    jt.id,
    jt.status,
    jt.agent_id,
    a.name,
    jt.keyspace_start,
    jt.keyspace_end,
    jt.updated_at
FROM job_tasks jt
JOIN agents a ON jt.agent_id = a.id
WHERE jt.status = 'reconnect_pending'
ORDER BY jt.updated_at DESC;

Log Analysis Techniques¶

Structured Log Search¶

# Find connection events for specific agent
grep -i "agent.*123\|Agent 123" logs/backend/*.log | grep -i "connect"

# Track WebSocket message flow
grep -i "websocket\|message.*type" logs/backend/*.log | tail -50

# Monitor heartbeat activity
grep -i "heartbeat\|ping\|pong" logs/backend/*.log | tail -20

# Check for error patterns
grep -i "error\|fail\|timeout" logs/backend/*.log | grep -i "agent" | tail -10

Performance Issue Detection¶

# Find performance-related messages
grep -i "performance\|slow\|timeout\|benchmark" logs/backend/*.log

# Check for resource issues
grep -i "memory\|cpu\|disk\|resource" logs/backend/*.log

# Monitor cleanup operations
grep -i "cleanup\|maintenance\|retention" logs/backend/*.log

Automated Issue Detection¶

Set up monitoring alerts for common issues:

# Script: monitor_agents.sh
#!/bin/bash

# Check for agents with stale heartbeats
STALE_AGENTS=$(psql -t -c "SELECT COUNT(*) FROM agents WHERE last_heartbeat < NOW() - INTERVAL '5 minutes' AND status = 'active';")

if [ "$STALE_AGENTS" -gt 0 ]; then
    echo "ALERT: $STALE_AGENTS agents have stale heartbeats"
    # Send notification
fi

# Check for high GPU temperatures
HOT_GPUS=$(psql -t -c "SELECT COUNT(*) FROM agent_performance_metrics WHERE metric_type = 'temperature' AND value > 85 AND timestamp > NOW() - INTERVAL '5 minutes';")

if [ "$HOT_GPUS" -gt 0 ]; then
    echo "ALERT: $HOT_GPUS GPUs running hot (>85°C)"
    # Send notification
fi

# Check for agents with no enabled devices
NO_DEVICE_AGENTS=$(psql -t -c "SELECT COUNT(DISTINCT a.id) FROM agents a LEFT JOIN agent_devices ad ON a.id = ad.agent_id WHERE a.status = 'active' AND NOT EXISTS (SELECT 1 FROM agent_devices ad2 WHERE ad2.agent_id = a.id AND ad2.enabled = true);")

if [ "$NO_DEVICE_AGENTS" -gt 0 ]; then
    echo "ALERT: $NO_DEVICE_AGENTS active agents have no enabled devices"
    # Send notification  
fi

Best Practices¶

Monitoring Strategy¶

1. Proactive Monitoring¶

Set up automated alerts for critical metrics (heartbeat failures, high temperatures, low utilization)
Establish performance baselines for each agent to detect degradation
Monitor trends rather than just current values
Use multiple monitoring approaches (real-time dashboard + historical analysis)

2. Alert Thresholds¶

# Recommended alert thresholds:
Agent Heartbeat: > 5 minutes offline
GPU Temperature: > 85°C sustained  
GPU Utilization: < 50% during jobs
Connection Failures: > 3 consecutive failures
Performance Degradation: > 20% decrease from baseline

3. Regular Health Checks¶

Daily: Review agent status dashboard, check for offline agents
Weekly: Analyze performance trends, identify degradation patterns
Monthly: Review hardware utilization, plan capacity changes
Quarterly: Update performance baselines, optimize configurations

Operational Best Practices¶

1. Fleet Management¶

Group agents by location, hardware type, or team assignment
Use naming conventions that reflect agent purpose and location
Document agent configurations including hardware specs and special parameters
Maintain agent inventory with ownership and responsibility assignments

2. Performance Optimization¶

Monitor benchmark results and update when hardware changes
Balance task distribution across agents based on capability
Track device utilization to identify underused resources
Optimize extra parameters for each agent's hardware configuration

3. Maintenance Scheduling¶

Plan maintenance windows during low-activity periods
Coordinate updates to minimize impact on running jobs
Test configuration changes on non-critical agents first
Document maintenance activities and their impact on performance

Monitoring Data Retention¶

1. Storage Management¶

-- Configure retention policies based on needs:
Real-time metrics: 7-14 days (high-frequency data)
Daily aggregates: 30-90 days (performance analysis)  
Weekly aggregates: 1-2 years (long-term trends)
Benchmark results: Indefinite (configuration reference)

2. Data Cleanup Automation¶

Enable automatic aggregation to reduce storage requirements
Monitor database growth and adjust retention as needed
Archive historical data for compliance or analysis needs
Use monitoring settings UI to adjust retention policies

Security and Access Control¶

1. Monitoring Access¶

Restrict monitoring access to appropriate administrators
Use role-based permissions for different monitoring functions
Audit monitoring activities and configuration changes
Secure monitoring endpoints with proper authentication

2. Agent Communication Security¶

Monitor certificate status and renewal schedules
Track authentication failures and suspicious activity
Use secure WebSocket connections (WSS) in production
Regularly rotate API keys and monitor key usage

Disaster Recovery and Continuity¶

1. Monitoring System Availability¶

Implement monitoring redundancy to avoid single points of failure
Backup monitoring configurations and historical data
Test monitoring system recovery procedures
Document escalation procedures for monitoring system failures

2. Agent Recovery Procedures¶

Automate agent reconnection with exponential backoff
Implement graceful degradation when agents are offline
Buffer critical messages during disconnections for recovery
Track task recovery and automatic redistribution

This comprehensive monitoring guide enables administrators to effectively manage distributed KrakenHashes agent fleets, ensuring optimal performance, rapid issue detection, and reliable operation across diverse hardware configurations and network environments.