System Monitoring Guide¶
This guide covers comprehensive monitoring strategies for KrakenHashes, including system health indicators, performance metrics, logging, and alerting configurations.
Table of Contents¶
- System Health Indicators
- Job Monitoring and Statistics
- Agent Performance Metrics
- Database Monitoring
- Log Analysis and Alerting
- Performance Baselines
- Monitoring Dashboards and Tools
System Health Indicators¶
Health Check Endpoint¶
The system provides a basic health check endpoint for monitoring service availability:
Service Status Monitoring¶
Monitor the following key services:
- Backend API Service
- Port: 31337 (HTTPS), 1337 (HTTP for CA cert)
- Health endpoint:
/api/health
-
Version endpoint:
/api/version
-
PostgreSQL Database
- Port: 5432
- Connection pool status
-
Active connections
-
WebSocket Service
- Agent connections
- Heartbeat status
- Connection count
Docker Container Health¶
Monitor container status using Docker commands:
# Check container status
docker-compose ps
# Monitor resource usage
docker stats
# Check container logs
docker-compose logs -f backend
docker-compose logs -f postgres
docker-compose logs -f app
Job Monitoring and Statistics¶
Job Execution Metrics¶
The system tracks comprehensive job execution metrics:
- Job Status Distribution
- Pending jobs count
- Running jobs count
- Completed jobs count
- Failed jobs count
-
Cancelled jobs count
-
Job Performance Indicators
- Average job completion time
- Job success rate
- Hash cracking rate
- Keyspace coverage
Job Monitoring Endpoints¶
# List all jobs with pagination
GET /api/jobs?page=1&page_size=20
# Get specific job details
GET /api/jobs/{job_id}
# Get job statistics
GET /api/jobs/stats
Job Progress Tracking¶
Monitor job progress through these metrics:
- Dispatched Percentage: Portion of keyspace distributed to agents
- Searched Percentage: Portion of keyspace actually processed
- Overall Progress: Combined metric considering rule splitting
- Cracked Count: Number of successfully cracked hashes
- Total Speed: Combined hash rate across all agents
Database Queries for Job Monitoring¶
-- Active jobs by status
SELECT status, COUNT(*) as count
FROM job_executions
GROUP BY status;
-- Jobs with high failure rate
SELECT je.id, je.name, je.error_message,
COUNT(jt.id) as total_tasks,
SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) as failed_tasks
FROM job_executions je
JOIN job_tasks jt ON je.id = jt.job_execution_id
WHERE je.status = 'failed'
GROUP BY je.id, je.name, je.error_message
HAVING SUM(CASE WHEN jt.status = 'failed' THEN 1 ELSE 0 END) > 0;
-- Job performance over time
SELECT
DATE_TRUNC('hour', created_at) as hour,
COUNT(*) as jobs_created,
AVG(EXTRACT(EPOCH FROM (completed_at - created_at))) as avg_duration_seconds
FROM job_executions
WHERE completed_at IS NOT NULL
GROUP BY hour
ORDER BY hour DESC;
Agent Performance Metrics¶
Agent Metrics Collection¶
The agent collects and reports the following metrics:
- System Metrics
- CPU usage percentage
- Memory usage percentage
- GPU utilization
- GPU temperature
-
GPU memory usage
-
Performance Metrics
- Hash rate per device
- Power usage
Device monitoring dashboard showing real-time temperature, utilization, fan speed, and hash rate metrics across multiple agents
Alternative view of the device monitoring interface with detailed performance graphs and timeline data - Fan speed - Device temperature
Agent Monitoring Endpoints¶
# List all agents
GET /api/admin/agents
# Get agent details with devices
GET /api/admin/agents/{agent_id}
# Get agent performance metrics
GET /api/admin/agents/{agent_id}/metrics?timeRange=1h&metrics=temperature,utilization,fanspeed,hashrate
Agent Health Monitoring¶
Monitor agent health through:
- Heartbeat Status
- Last heartbeat timestamp
- Connection status (active/inactive)
-
Heartbeat interval (30 seconds default)
-
Error Tracking
- Last error message
- Error frequency
- Recovery status
Database Queries for Agent Monitoring¶
-- Agents with stale heartbeats
SELECT id, name, last_heartbeat, status
FROM agents
WHERE last_heartbeat < NOW() - INTERVAL '5 minutes'
AND status = 'active';
-- Agent performance metrics
SELECT
a.name as agent_name,
apm.metric_type,
AVG(apm.value) as avg_value,
MAX(apm.value) as max_value,
MIN(apm.value) as min_value
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.metric_type;
-- GPU device utilization
SELECT
a.name as agent_name,
apm.device_name,
apm.metric_type,
AVG(apm.value) as avg_utilization
FROM agents a
JOIN agent_performance_metrics apm ON a.id = apm.agent_id
WHERE apm.metric_type = 'utilization'
AND apm.timestamp > NOW() - INTERVAL '1 hour'
GROUP BY a.name, apm.device_name, apm.metric_type
ORDER BY avg_utilization DESC;
Database Monitoring¶
Connection Pool Monitoring¶
Monitor database connection health:
-- Active connections by state
SELECT state, COUNT(*)
FROM pg_stat_activity
GROUP BY state;
-- Long-running queries
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
-- Database size growth
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;
Table Statistics¶
Monitor table growth and performance:
-- Table sizes
SELECT
schemaname AS table_schema,
tablename AS table_name,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS data_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Index usage
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;
Performance Metrics Tables¶
The system maintains dedicated tables for performance tracking:
- agent_metrics - Real-time agent system metrics
- agent_performance_metrics - Detailed performance data with aggregation
- job_performance_metrics - Job execution performance tracking
- agent_benchmarks - Hashcat benchmark results per agent
Log Analysis and Alerting¶
Log Configuration¶
Configure logging through environment variables:
# Enable debug logging
export DEBUG=true
# Set log level (DEBUG, INFO, WARNING, ERROR)
export LOG_LEVEL=INFO
Log Locations¶
When running with Docker, logs are stored in:
/home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/
├── backend/ # Backend application logs
├── postgres/ # PostgreSQL logs
└── nginx/ # Nginx/frontend logs
Log Format¶
The system uses structured logging with the following format:
Example:
Key Log Patterns to Monitor¶
-
Error Patterns
# Find all errors across logs grep -i "error" /home/zerkereod/Programming/passwordCracking/kh-backend/logs/krakenhashes/*/*.log # Find database connection errors grep -i "database.*error\|connection.*failed" logs/backend/*.log # Find agent disconnections grep -i "agent.*disconnect\|websocket.*close" logs/backend/*.log
-
Performance Issues
-
Security Events
Alert Configuration¶
Set up alerts for critical events:
- System Health Alerts
- Service down (health check fails)
- Database connection pool exhausted
-
High error rate (>5% of requests)
-
Performance Alerts
- CPU usage > 90% for 5 minutes
- Memory usage > 85%
- Database query time > 5 seconds
-
Job queue backlog > 100 jobs
-
Security Alerts
- Multiple failed login attempts
- Unauthorized API access attempts
- Agent registration anomalies
Performance Baselines¶
Establishing Baselines¶
Monitor and document normal operating parameters:
- System Resource Baselines
- Normal CPU usage: 20-40% (idle), 60-80% (active jobs)
- Memory usage: 2-4GB (base), +1GB per 1M hashes
-
Database connections: 10-20 (normal load)
-
Job Performance Baselines
- Job creation rate: 10-50 jobs/hour
- Average job duration: Varies by attack type
-
Hash processing rate: Device-dependent
-
Agent Performance Baselines
- Heartbeat interval: 30 seconds
- Benchmark cache duration: 24 hours
- GPU utilization: 90-100% during jobs
Benchmark Tracking¶
The system automatically tracks agent benchmarks:
-- View agent benchmarks
SELECT
a.name as agent_name,
ab.attack_mode,
ab.hash_type,
ab.speed,
ab.updated_at
FROM agents a
JOIN agent_benchmarks ab ON a.id = ab.agent_id
ORDER BY a.name, ab.attack_mode, ab.hash_type;
Performance Degradation Detection¶
Monitor for performance degradation:
-- Compare current vs historical performance
WITH current_metrics AS (
SELECT
agent_id,
AVG(value) as current_avg
FROM agent_performance_metrics
WHERE metric_type = 'hash_rate'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY agent_id
),
historical_metrics AS (
SELECT
agent_id,
AVG(value) as historical_avg
FROM agent_performance_metrics
WHERE metric_type = 'hash_rate'
AND timestamp BETWEEN NOW() - INTERVAL '1 week' AND NOW() - INTERVAL '1 day'
GROUP BY agent_id
)
SELECT
a.name,
cm.current_avg,
hm.historical_avg,
((cm.current_avg - hm.historical_avg) / hm.historical_avg * 100) as percent_change
FROM agents a
JOIN current_metrics cm ON a.id = cm.agent_id
JOIN historical_metrics hm ON a.id = hm.agent_id
WHERE ABS((cm.current_avg - hm.historical_avg) / hm.historical_avg) > 0.1;
Monitoring Dashboards and Tools¶
Built-in Monitoring Endpoints¶
- System Status Dashboard
- Real-time agent status
- Active job count
- System resource usage
-
Recent errors
-
Job Monitoring Dashboard
- Job queue status
- Job progress tracking
- Success/failure rates
-
Performance trends
-
Agent Performance Dashboard
- Agent availability
- Device utilization
- Temperature monitoring
- Hash rate tracking
External Monitoring Integration¶
The system can be integrated with external monitoring tools:
- Prometheus Integration
- Export metrics via
/metrics
endpoint (if implemented) - Custom metric exporters
-
Alert manager integration
-
Grafana Dashboards
- PostgreSQL data source
- Custom dashboard templates
-
Alert visualization
-
Log Aggregation
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Fluentd/Fluent Bit
- Centralized log analysis
Monitoring Best Practices¶
- Regular Health Checks
- Automated health check every 30 seconds
- Alert on 3 consecutive failures
-
Include dependency checks
-
Capacity Planning
- Monitor growth trends
- Plan for peak usage
-
Scale resources proactively
-
Performance Optimization
- Regular benchmark updates
- Query optimization based on metrics
-
Resource allocation tuning
-
Security Monitoring
- Audit log analysis
- Anomaly detection
- Access pattern monitoring
Troubleshooting Guide¶
Common issues and monitoring approaches:
- High CPU Usage
- Check active job count
- Verify agent task distribution
-
Monitor database query performance
-
Memory Leaks
- Track memory usage over time
- Identify growing processes
-
Check for unclosed connections
-
Slow Job Processing
- Verify agent benchmarks
- Check network latency
-
Monitor file I/O performance
-
Database Performance
- Analyze slow queries
- Check index usage
- Monitor connection pool
Maintenance and Cleanup¶
Automated Cleanup Services¶
The system includes several cleanup services:
- Metrics Cleanup Service
- Aggregates real-time metrics to daily/weekly
- Removes old metrics based on retention policy
-
Runs automatically on schedule
-
Agent Cleanup Service
- Marks stale agents as inactive
- Cleans up orphaned resources
-
Maintains agent health status
-
Job Cleanup Service
- Archives completed jobs
- Removes temporary files
- Updates job statistics
Manual Maintenance Tasks¶
# Force cleanup of old metrics
curl -X POST https://localhost:31337/api/admin/force-cleanup
# Vacuum database
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "VACUUM ANALYZE;"
# Check database bloat
docker exec -it krakenhashes_postgres_1 psql -U postgres -d krakenhashes -c "
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"
Conclusion¶
Effective monitoring is crucial for maintaining a healthy KrakenHashes deployment. Regular monitoring of system health, job performance, and agent metrics ensures optimal operation and early detection of issues. Implement automated alerting for critical metrics and maintain historical data for trend analysis and capacity planning.