System Monitoring Guide¶

This guide covers comprehensive monitoring strategies for KrakenHashes, including system health indicators, performance metrics, logging, and alerting configurations.

Table of Contents¶

System Health Indicators
Job Monitoring and Statistics
Agent Performance Metrics
Database Monitoring
Log Analysis and Alerting
Performance Baselines
Monitoring Dashboards and Tools

System Health Indicators¶

Health Check Endpoint¶

The system provides a basic health check endpoint for monitoring service availability:

# Check system health
curl https://localhost:31337/api/health

# Expected response
200 OK

Service Status Monitoring¶

Monitor the following key services:

Backend API Service
Port: 31337 (HTTPS), 1337 (HTTP for CA cert)
Health endpoint: /api/health
Version endpoint: /api/version
PostgreSQL Database
Port: 5432
Connection pool status
Active connections
WebSocket Service
Agent connections
Heartbeat status
Connection count

Docker Container Health¶

Monitor container status using Docker commands:

# Check container status
docker-compose ps

# Monitor resource usage
docker stats

# Check container logs
docker-compose logs -f backend
docker-compose logs -f postgres
docker-compose logs -f app

Job Monitoring and Statistics¶

Job Execution Metrics¶

The system tracks comprehensive job execution metrics:

Job Status Distribution
Pending jobs count
Running jobs count
Completed jobs count
Failed jobs count
Cancelled jobs count
Job Performance Indicators
Average job completion time
Job success rate
Hash cracking rate
Keyspace coverage

Job Monitoring Endpoints¶

# List all jobs with pagination
GET /api/jobs?page=1&page_size=20

# Get specific job details
GET /api/jobs/{job_id}

# Get job statistics
GET /api/jobs/stats

Job Progress Tracking¶

Monitor job progress through these metrics:

Dispatched Percentage: Portion of keyspace distributed to agents
Searched Percentage: Portion of keyspace actually processed
Overall Progress: Combined metric considering rule splitting
Cracked Count: Number of successfully cracked hashes
Total Speed: Combined hash rate across all agents

Database Queries for Job Monitoring¶

-- Active jobs by status
SELECT status, COUNT(*) as count 
FROM job_executions 
GROUP BY status;

-- Jobs with high failure rate
SELECT je.id, je.name, je.error_message,
       COUNT(jt.id) as total_tasks,
       SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) as failed_tasks
FROM job_executions je
JOIN job_tasks jt ON je.id = jt.job_execution_id
WHERE je.status = 'failed'
GROUP BY je.id, je.name, je.error_message
HAVING SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) > 0;

-- Job performance over time
SELECT 
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as jobs_created,
    AVG(EXTRACT(EPOCH FROM (completed_at - created_at))) as avg_duration_seconds
FROM job_executions
WHERE completed_at IS NOT NULL
GROUP BY hour
ORDER BY hour DESC;

Agent Performance Metrics¶

Agent Metrics Collection¶

The agent collects and reports the following metrics:

System Metrics
CPU usage percentage
Memory usage percentage
GPU utilization
GPU temperature
GPU memory usage
Performance Metrics
Hash rate per device
Power usage

Device monitoring dashboard showing real-time temperature, utilization, fan speed, and hash rate metrics across multiple agents

Device Monitoring Alternative View Alternative view of the device monitoring interface with detailed performance graphs and timeline data - Fan speed - Device temperature

Agent Monitoring Endpoints¶

# List all agents
GET /api/admin/agents

# Get agent details with devices
GET /api/admin/agents/{agent_id}

# Get agent performance metrics
GET /api/admin/agents/{agent_id}/metrics?timeRange=1h&metrics=temperature,utilization,fanspeed,hashrate

Agent Health Monitoring¶

Monitor agent health through:

Heartbeat Status
Last heartbeat timestamp
Connection status (active/inactive)
Heartbeat interval (30 seconds default)
Error Tracking
Last error message
Error frequency
Recovery status

Database Queries for Agent Monitoring¶

-- Agents with stale heartbeats
SELECT id, name, last_heartbeat, status
FROM agents
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes'
  AND status = 'active';

-- Agent performance metrics
SELECT 
    a.name as agent_name,
    apm.metric_type,
    AVG(apm.value) as avg_value,
    MAX(apm.value) as max_value,
    MIN(apm.value) as min_value
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.metric_type;

-- GPU device utilization
SELECT 
    a.name as agent_name,
    apm.device_name,
    apm.metric_type,
    AVG(apm.value) as avg_utilization
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.metric_type = 'utilization'
  AND apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.device_name, apm.metric_type
ORDER BY avg_utilization DESC;

Database Monitoring¶

Connection Pool Monitoring¶

Monitor database connection health:

-- Active connections by state
SELECT state, COUNT(*) 
FROM pg_stat_activity 
GROUP BY state;

-- Long-running queries
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

-- Database size growth
SELECT 
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

Table Statistics¶

Monitor table growth and performance:

-- Table sizes
SELECT
    schemaname AS table_schema,
    tablename AS table_name,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS data_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index usage
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

Performance Metrics Tables¶

The system maintains dedicated tables for performance tracking:

agent_metrics - Real-time agent system metrics
agent_performance_metrics - Detailed performance data with aggregation
job_performance_metrics - Job execution performance tracking
agent_benchmarks - Hashcat benchmark results per agent

Log Analysis and Alerting¶

Log Configuration¶

Configure logging through environment variables:

# Enable debug logging
export DEBUG=true

# Set log level (DEBUG, INFO, WARNING, ERROR)
export LOG_LEVEL=INFO

Log Locations¶

When running with Docker, logs are stored in:

/home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/
├── backend/      # Backend application logs
├── postgres/     # PostgreSQL logs
└── nginx/        # Nginx/frontend logs

Log Format¶

The system uses structured logging with the following format:

[LEVEL] [TIMESTAMP] [FILE:LINE] [FUNCTION] MESSAGE

Example:

[INFO] [2025-08-01 15:04:05.000] [/path/to/file.go:42] [FunctionName] Processing job execution

Key Log Patterns to Monitor¶

Error Patterns

# Find all errors across logs
grep -i "error" /home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/*/*.log

# Find database connection errors
grep -i "database.*error\|connection.*failed" logs/backend/*.log

# Find agent disconnections
grep -i "agent.*disconnect\|websocket.*close" logs/backend/*.log

Performance Issues

# Find slow queries
grep -i "slow query\|query took" logs/backend/*.log

# Find memory issues
grep -i "out of memory\|memory.*limit" logs/*/*.log

Security Events

# Find authentication failures
grep -i "auth.*fail\|login.*fail\|unauthorized" logs/backend/*.log

# Find suspicious activity
grep -i "invalid.*token\|forbidden\|suspicious" logs/backend/*.log

Alert Configuration¶

Set up alerts for critical events:

System Health Alerts
Service down (health check fails)
Database connection pool exhausted
High error rate (>5% of requests)
Performance Alerts
CPU usage > 90% for 5 minutes
Memory usage > 85%
Database query time > 5 seconds
Job queue backlog > 100 jobs
Security Alerts
Multiple failed login attempts
Unauthorized API access attempts
Agent registration anomalies

Performance Baselines¶

Establishing Baselines¶

Monitor and document normal operating parameters:

System Resource Baselines
Normal CPU usage: 20-40% (idle), 60-80% (active jobs)
Memory usage: 2-4GB (base), +1GB per 1M hashes
Database connections: 10-20 (normal load)
Job Performance Baselines
Job creation rate: 10-50 jobs/hour
Average job duration: Varies by attack type
Hash processing rate: Device-dependent
Agent Performance Baselines
Heartbeat interval: 30 seconds
Benchmark cache duration: 24 hours
GPU utilization: 90-100% during jobs

Benchmark Tracking¶

The system automatically tracks agent benchmarks:

-- View agent benchmarks
SELECT 
    a.name as agent_name,
    ab.attack_mode,
    ab.hash_type,
    ab.speed,
    ab.updated_at
FROM agents a
JOIN agent_benchmarks ab ON a.id = ab.agent_id
ORDER BY a.name, ab.attack_mode, ab.hash_type;

Performance Degradation Detection¶

Monitor for performance degradation:

-- Compare current vs historical performance
WITH current_metrics AS (
    SELECT 
        agent_id,
        AVG(value) as current_avg
    FROM agent_performance_metrics
    WHERE metric_type = 'hash_rate'
      AND timestamp > NOW() - INTERVAL '1 hour'
    GROUP BY agent_id
),
historical_metrics AS (
    SELECT 
        agent_id,
        AVG(value) as historical_avg
    FROM agent_performance_metrics
    WHERE metric_type = 'hash_rate'
      AND timestamp BETWEEN NOW() - INTERVAL '1 week' AND NOW() - INTERVAL '1 day'
    GROUP BY agent_id
)
SELECT 
    a.name,
    cm.current_avg,
    hm.historical_avg,
    ((cm.current_avg - hm.historical_avg) / hm.historical_avg * 100) as percent_change
FROM agents a
JOIN current_metrics cm ON a.id = cm.agent_id
JOIN historical_metrics hm ON a.id = hm.agent_id
WHERE ABS((cm.current_avg - hm.historical_avg) / hm.historical_avg) > 0.1;

Monitoring Dashboards and Tools¶

Built-in Monitoring Endpoints¶

System Status Dashboard
Real-time agent status
Active job count
System resource usage
Recent errors
Job Monitoring Dashboard
Job queue status
Job progress tracking
Success/failure rates
Performance trends
Agent Performance Dashboard
Agent availability
Device utilization
Temperature monitoring
Hash rate tracking

External Monitoring Integration¶

The system can be integrated with external monitoring tools:

Prometheus Integration
Export metrics via /metrics endpoint (if implemented)
Custom metric exporters
Alert manager integration
Grafana Dashboards
PostgreSQL data source
Custom dashboard templates
Alert visualization
Log Aggregation
ELK Stack (Elasticsearch, Logstash, Kibana)
Fluentd/Fluent Bit
Centralized log analysis

Monitoring Best Practices¶

Regular Health Checks
Automated health check every 30 seconds
Alert on 3 consecutive failures
Include dependency checks
Capacity Planning
Monitor growth trends
Plan for peak usage
Scale resources proactively
Performance Optimization
Regular benchmark updates
Query optimization based on metrics
Resource allocation tuning
Security Monitoring
Audit log analysis
Anomaly detection
Access pattern monitoring

Troubleshooting Guide¶

Common issues and monitoring approaches:

High CPU Usage
Check active job count
Verify agent task distribution
Monitor database query performance
Memory Leaks
Track memory usage over time
Identify growing processes
Check for unclosed connections
Slow Job Processing
Verify agent benchmarks
Check network latency
Monitor file I/O performance
Database Performance
Analyze slow queries
Check index usage
Monitor connection pool

Maintenance and Cleanup¶

Automated Cleanup Services¶

The system includes several cleanup services:

Metrics Cleanup Service
Aggregates real-time metrics to daily/weekly
Removes old metrics based on retention policy
Runs automatically on schedule
Agent Cleanup Service
Marks stale agents as inactive
Cleans up orphaned resources
Maintains agent health status
Job Cleanup Service
Archives completed jobs
Removes temporary files
Updates job statistics

Manual Maintenance Tasks¶

# Force cleanup of old metrics
curl -X POST https://localhost:31337/api/admin/force-cleanup

# Vacuum database
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "VACUUM ANALYZE;"

# Check database bloat
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Conclusion¶

Effective monitoring is crucial for maintaining a healthy KrakenHashes deployment. Regular monitoring of system health, job performance, and agent metrics ensures optimal operation and early detection of issues. Implement automated alerting for critical metrics and maintain historical data for trend analysis and capacity planning.