Skip to content

System Monitoring Guide

This guide covers comprehensive monitoring strategies for KrakenHashes, including system health indicators, performance metrics, logging, and alerting configurations.

Table of Contents

  1. System Health Indicators
  2. Job Monitoring and Statistics
  3. Agent Performance Metrics
  4. Database Monitoring
  5. Log Analysis and Alerting
  6. Performance Baselines
  7. Monitoring Dashboards and Tools

System Health Indicators

Health Check Endpoint

The system provides a basic health check endpoint for monitoring service availability:

# Check system health
curl https://localhost:31337/api/health

# Expected response
200 OK

Service Status Monitoring

Monitor the following key services:

  1. Backend API Service
  2. Port: 31337 (HTTPS), 1337 (HTTP for CA cert)
  3. Health endpoint: /api/health
  4. Version endpoint: /api/version

  5. PostgreSQL Database

  6. Port: 5432
  7. Connection pool status
  8. Active connections

  9. WebSocket Service

  10. Agent connections
  11. Heartbeat status
  12. Connection count

Docker Container Health

Monitor container status using Docker commands:

# Check container status
docker-compose ps

# Monitor resource usage
docker stats

# Check container logs
docker-compose logs -f backend
docker-compose logs -f postgres
docker-compose logs -f app

Job Monitoring and Statistics

Job Execution Metrics

The system tracks comprehensive job execution metrics:

  1. Job Status Distribution
  2. Pending jobs count
  3. Running jobs count
  4. Completed jobs count
  5. Failed jobs count
  6. Cancelled jobs count

  7. Job Performance Indicators

  8. Average job completion time
  9. Job success rate
  10. Hash cracking rate
  11. Keyspace coverage

Job Monitoring Endpoints

# List all jobs with pagination
GET /api/jobs?page=1&page_size=20

# Get specific job details
GET /api/jobs/{job_id}

# Get job statistics
GET /api/jobs/stats

Job Progress Tracking

Monitor job progress through these metrics:

  • Dispatched Percentage: Portion of keyspace distributed to agents
  • Searched Percentage: Portion of keyspace actually processed
  • Overall Progress: Combined metric considering rule splitting
  • Cracked Count: Number of successfully cracked hashes
  • Total Speed: Combined hash rate across all agents

Database Queries for Job Monitoring

-- Active jobs by status
SELECT status, COUNT(*) as count 
FROM job_executions 
GROUP BY status;

-- Jobs with high failure rate
SELECT je.id, je.name, je.error_message,
       COUNT(jt.id) as total_tasks,
       SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) as failed_tasks
FROM job_executions je
JOIN job_tasks jt ON je.id = jt.job_execution_id
WHERE je.status = 'failed'
GROUP BY je.id, je.name, je.error_message
HAVING SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) > 0;

-- Job performance over time
SELECT 
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as jobs_created,
    AVG(EXTRACT(EPOCH FROM (completed_at - created_at))) as avg_duration_seconds
FROM job_executions
WHERE completed_at IS NOT NULL
GROUP BY hour
ORDER BY hour DESC;

Agent Performance Metrics

Agent Metrics Collection

The agent collects and reports the following metrics:

  1. System Metrics
  2. CPU usage percentage
  3. Memory usage percentage
  4. GPU utilization
  5. GPU temperature
  6. GPU memory usage

  7. Performance Metrics

  8. Hash rate per device
  9. Power usage

Device Monitoring Dashboard Device monitoring dashboard showing real-time temperature, utilization, fan speed, and hash rate metrics across multiple agents

Device Monitoring Alternative View Alternative view of the device monitoring interface with detailed performance graphs and timeline data - Fan speed - Device temperature

Agent Monitoring Endpoints

# List all agents
GET /api/admin/agents

# Get agent details with devices
GET /api/admin/agents/{agent_id}

# Get agent performance metrics
GET /api/admin/agents/{agent_id}/metrics?timeRange=1h&metrics=temperature,utilization,fanspeed,hashrate

Agent Health Monitoring

Monitor agent health through:

  1. Heartbeat Status
  2. Last heartbeat timestamp
  3. Connection status (active/inactive)
  4. Heartbeat interval (30 seconds default)

  5. Error Tracking

  6. Last error message
  7. Error frequency
  8. Recovery status

Database Queries for Agent Monitoring

-- Agents with stale heartbeats
SELECT id, name, last_heartbeat, status
FROM agents
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes'
  AND status = 'active';

-- Agent performance metrics
SELECT 
    a.name as agent_name,
    apm.metric_type,
    AVG(apm.value) as avg_value,
    MAX(apm.value) as max_value,
    MIN(apm.value) as min_value
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.metric_type;

-- GPU device utilization
SELECT 
    a.name as agent_name,
    apm.device_name,
    apm.metric_type,
    AVG(apm.value) as avg_utilization
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.metric_type = 'utilization'
  AND apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.device_name, apm.metric_type
ORDER BY avg_utilization DESC;

Database Monitoring

Connection Pool Monitoring

Monitor database connection health:

-- Active connections by state
SELECT state, COUNT(*) 
FROM pg_stat_activity 
GROUP BY state;

-- Long-running queries
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

-- Database size growth
SELECT 
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

Table Statistics

Monitor table growth and performance:

-- Table sizes
SELECT
    schemaname AS table_schema,
    tablename AS table_name,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS data_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index usage
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

Performance Metrics Tables

The system maintains dedicated tables for performance tracking:

  1. agent_metrics - Real-time agent system metrics
  2. agent_performance_metrics - Detailed performance data with aggregation
  3. job_performance_metrics - Job execution performance tracking
  4. agent_benchmarks - Hashcat benchmark results per agent

Log Analysis and Alerting

Log Configuration

Configure logging through environment variables:

# Enable debug logging
export DEBUG=true

# Set log level (DEBUG, INFO, WARNING, ERROR)
export LOG_LEVEL=INFO

Log Locations

When running with Docker, logs are stored in:

/home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/
├── backend/      # Backend application logs
├── postgres/     # PostgreSQL logs
└── nginx/        # Nginx/frontend logs

Log Format

The system uses structured logging with the following format:

[LEVEL] [TIMESTAMP] [FILE:LINE] [FUNCTION] MESSAGE

Example:

[INFO] [2025-08-01 15:04:05.000] [/path/to/file.go:42] [FunctionName] Processing job execution

Key Log Patterns to Monitor

  1. Error Patterns

    # Find all errors across logs
    grep -i "error" /home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/*/*.log
    
    # Find database connection errors
    grep -i "database.*error\|connection.*failed" logs/backend/*.log
    
    # Find agent disconnections
    grep -i "agent.*disconnect\|websocket.*close" logs/backend/*.log
    

  2. Performance Issues

    # Find slow queries
    grep -i "slow query\|query took" logs/backend/*.log
    
    # Find memory issues
    grep -i "out of memory\|memory.*limit" logs/*/*.log
    

  3. Security Events

    # Find authentication failures
    grep -i "auth.*fail\|login.*fail\|unauthorized" logs/backend/*.log
    
    # Find suspicious activity
    grep -i "invalid.*token\|forbidden\|suspicious" logs/backend/*.log
    

Alert Configuration

Set up alerts for critical events:

  1. System Health Alerts
  2. Service down (health check fails)
  3. Database connection pool exhausted
  4. High error rate (>5% of requests)

  5. Performance Alerts

  6. CPU usage > 90% for 5 minutes
  7. Memory usage > 85%
  8. Database query time > 5 seconds
  9. Job queue backlog > 100 jobs

  10. Security Alerts

  11. Multiple failed login attempts
  12. Unauthorized API access attempts
  13. Agent registration anomalies

Performance Baselines

Establishing Baselines

Monitor and document normal operating parameters:

  1. System Resource Baselines
  2. Normal CPU usage: 20-40% (idle), 60-80% (active jobs)
  3. Memory usage: 2-4GB (base), +1GB per 1M hashes
  4. Database connections: 10-20 (normal load)

  5. Job Performance Baselines

  6. Job creation rate: 10-50 jobs/hour
  7. Average job duration: Varies by attack type
  8. Hash processing rate: Device-dependent

  9. Agent Performance Baselines

  10. Heartbeat interval: 30 seconds
  11. Benchmark cache duration: 24 hours
  12. GPU utilization: 90-100% during jobs

Benchmark Tracking

The system automatically tracks agent benchmarks:

-- View agent benchmarks
SELECT 
    a.name as agent_name,
    ab.attack_mode,
    ab.hash_type,
    ab.speed,
    ab.updated_at
FROM agents a
JOIN agent_benchmarks ab ON a.id = ab.agent_id
ORDER BY a.name, ab.attack_mode, ab.hash_type;

Performance Degradation Detection

Monitor for performance degradation:

-- Compare current vs historical performance
WITH current_metrics AS (
    SELECT 
        agent_id,
        AVG(value) as current_avg
    FROM agent_performance_metrics
    WHERE metric_type = 'hash_rate'
      AND timestamp > NOW() - INTERVAL '1 hour'
    GROUP BY agent_id
),
historical_metrics AS (
    SELECT 
        agent_id,
        AVG(value) as historical_avg
    FROM agent_performance_metrics
    WHERE metric_type = 'hash_rate'
      AND timestamp BETWEEN NOW() - INTERVAL '1 week' AND NOW() - INTERVAL '1 day'
    GROUP BY agent_id
)
SELECT 
    a.name,
    cm.current_avg,
    hm.historical_avg,
    ((cm.current_avg - hm.historical_avg) / hm.historical_avg * 100) as percent_change
FROM agents a
JOIN current_metrics cm ON a.id = cm.agent_id
JOIN historical_metrics hm ON a.id = hm.agent_id
WHERE ABS((cm.current_avg - hm.historical_avg) / hm.historical_avg) > 0.1;

Monitoring Dashboards and Tools

Built-in Monitoring Endpoints

  1. System Status Dashboard
  2. Real-time agent status
  3. Active job count
  4. System resource usage
  5. Recent errors

  6. Job Monitoring Dashboard

  7. Job queue status
  8. Job progress tracking
  9. Success/failure rates
  10. Performance trends

  11. Agent Performance Dashboard

  12. Agent availability
  13. Device utilization
  14. Temperature monitoring
  15. Hash rate tracking

External Monitoring Integration

The system can be integrated with external monitoring tools:

  1. Prometheus Integration
  2. Export metrics via /metrics endpoint (if implemented)
  3. Custom metric exporters
  4. Alert manager integration

  5. Grafana Dashboards

  6. PostgreSQL data source
  7. Custom dashboard templates
  8. Alert visualization

  9. Log Aggregation

  10. ELK Stack (Elasticsearch, Logstash, Kibana)
  11. Fluentd/Fluent Bit
  12. Centralized log analysis

Monitoring Best Practices

  1. Regular Health Checks
  2. Automated health check every 30 seconds
  3. Alert on 3 consecutive failures
  4. Include dependency checks

  5. Capacity Planning

  6. Monitor growth trends
  7. Plan for peak usage
  8. Scale resources proactively

  9. Performance Optimization

  10. Regular benchmark updates
  11. Query optimization based on metrics
  12. Resource allocation tuning

  13. Security Monitoring

  14. Audit log analysis
  15. Anomaly detection
  16. Access pattern monitoring

Troubleshooting Guide

Common issues and monitoring approaches:

  1. High CPU Usage
  2. Check active job count
  3. Verify agent task distribution
  4. Monitor database query performance

  5. Memory Leaks

  6. Track memory usage over time
  7. Identify growing processes
  8. Check for unclosed connections

  9. Slow Job Processing

  10. Verify agent benchmarks
  11. Check network latency
  12. Monitor file I/O performance

  13. Database Performance

  14. Analyze slow queries
  15. Check index usage
  16. Monitor connection pool

Maintenance and Cleanup

Automated Cleanup Services

The system includes several cleanup services:

  1. Metrics Cleanup Service
  2. Aggregates real-time metrics to daily/weekly
  3. Removes old metrics based on retention policy
  4. Runs automatically on schedule

  5. Agent Cleanup Service

  6. Marks stale agents as inactive
  7. Cleans up orphaned resources
  8. Maintains agent health status

  9. Job Cleanup Service

  10. Archives completed jobs
  11. Removes temporary files
  12. Updates job statistics

Manual Maintenance Tasks

# Force cleanup of old metrics
curl -X POST https://localhost:31337/api/admin/force-cleanup

# Vacuum database
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "VACUUM ANALYZE;"

# Check database bloat
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Conclusion

Effective monitoring is crucial for maintaining a healthy KrakenHashes deployment. Regular monitoring of system health, job performance, and agent metrics ensures optimal operation and early detection of issues. Implement automated alerting for critical metrics and maintain historical data for trend analysis and capacity planning.