Storage Architecture¶
Overview¶
KrakenHashes implements a centralized file storage system with intelligent deduplication, hash verification, and performance optimizations. This guide covers the storage architecture, capacity planning, and maintenance procedures.
Storage Directory Structure¶
The system organizes files into a hierarchical structure under the configured data directory (default: /var/lib/krakenhashes
in Docker, ~/.krakenhashes-data
locally):
/var/lib/krakenhashes/ # Root data directory (KH_DATA_DIR)
├── binaries/ # Hashcat/John binaries
│ ├── hashcat_7.6.0_linux64.tar.gz
│ └── john_1.9.0_linux64.tar.gz
├── wordlists/ # Wordlist files by category
│ ├── general/ # Common wordlists
│ ├── specialized/ # Domain-specific lists
│ ├── targeted/ # Custom targeted lists
│ └── custom/ # User-uploaded lists
├── rules/ # Rule files by type
│ ├── hashcat/ # Hashcat-compatible rules
│ ├── john/ # John-compatible rules
│ └── custom/ # Custom rule sets
├── hashlists/ # Processed hashlist files
│ ├── 1.hash # Uncracked hashes for job ID 1
│ └── 2.hash # Uncracked hashes for job ID 2
├── hashlist_uploads/ # Temporary upload storage
│ └── <user-id>/ # User-specific upload directories
└── local/ # Extracted binaries (server-side)
Directory Permissions¶
All directories are created with mode 0750
(rwxr-x---) to ensure: - Owner has full access - Group has read and execute access - Others have no access
File Deduplication and Hash Verification¶
MD5-Based Deduplication¶
KrakenHashes uses MD5 hashes for file deduplication across all resource types:
- Upload Processing
- Calculate MD5 hash of uploaded file
- Check database for existing file with same hash
- If exists, reference existing file instead of storing duplicate
-
If new, store file and record hash in database
-
Verification States
pending
- File uploaded but not yet verifiedverified
- File hash matches database recordfailed
- Hash mismatch or file corrupted-
deleted
- File removed from storage -
Database Schema
File Synchronization¶
The agent file sync system ensures consistency across distributed agents:
- Sync Protocol
- Agent reports current files with MD5 hashes
- Server compares against master file list
- Server sends list of files to download
-
Agent downloads only missing/changed files
-
Hash Verification
- Files are verified after download
- Failed verifications trigger re-download
- Corrupted files are automatically replaced
Storage Requirements and Capacity Planning¶
Estimating Storage Needs¶
Calculate storage requirements based on:
- Wordlists
- Common wordlists: 10-50 GB
- Specialized lists: 50-200 GB
-
Large collections: 500+ GB
-
Rules
- Basic rule sets: 1-10 MB
-
Comprehensive sets: 100-500 MB
-
Hashlists
- Original uploads: Variable
- Processed files: ~32 bytes per hash
-
Example: 1M hashes ≈ 32 MB
-
Binaries
- Hashcat package: ~100 MB
- John package: ~50 MB
- Multiple versions: Plan for 3-5 versions
Recommended Minimums¶
Deployment Size | Storage | Rationale |
---|---|---|
Development | 50 GB | Basic wordlists and testing |
Small Team | 200 GB | Standard wordlists + custom data |
Enterprise | 1 TB+ | Comprehensive wordlists + history |
Growth Considerations¶
- Hashlist accumulation: ~10-20% monthly growth typical
- Wordlist expansion: New lists added periodically
- Binary versions: Keep 3-5 recent versions
- Backup overhead: 2x storage for full backups
Backup Considerations¶
What to Backup¶
- Critical Data
- PostgreSQL database (contains all metadata)
- Custom wordlists and rules
-
Configuration files (
/etc/krakenhashes
) -
Recoverable Data
- Standard wordlists (can be re-downloaded)
- Binaries (can be re-downloaded)
- Processed hashlists (can be regenerated)
Backup Strategy¶
#!/bin/bash
# Example backup script
# Backup database
pg_dump -h postgres -U krakenhashes krakenhashes > backup/db_$(date +%Y%m%d).sql
# Backup custom data
rsync -av /var/lib/krakenhashes/wordlists/custom/ backup/wordlists/
rsync -av /var/lib/krakenhashes/rules/custom/ backup/rules/
rsync -av /etc/krakenhashes/ backup/config/
# Backup file metadata
docker-compose exec backend \
psql -c "COPY (SELECT * FROM wordlists) TO STDOUT CSV" > backup/wordlists_meta.csv
Restore Procedures¶
-
Database Restore
-
File Restore
-
Verify Integrity
- Run file verification for all restored files
- Check MD5 hashes against database records
Performance Optimization¶
File System Considerations¶
- File System Choice
- ext4: Good general performance
- XFS: Better for large files
-
ZFS: Built-in deduplication and compression
-
Mount Options
-
Storage Layout
- Use separate volumes for different data types
- Consider SSD for hashlists (frequent reads)
- HDDs acceptable for wordlists (sequential reads)
Caching Strategy¶
- Application-Level Caching
- Recently used wordlists kept in memory
- Hash type definitions cached
-
File metadata cached for 15 minutes
-
File System Caching
- Linux page cache handles frequently accessed files
- Monitor with
free -h
and adjustvm.vfs_cache_pressure
I/O Optimization¶
# Tune kernel parameters for better I/O
echo 'vm.dirty_ratio = 5' >> /etc/sysctl.conf
echo 'vm.dirty_background_ratio = 2' >> /etc/sysctl.conf
echo 'vm.vfs_cache_pressure = 50' >> /etc/sysctl.conf
sysctl -p
Docker Volume Management¶
Volume Configuration¶
Docker Compose creates named volumes for persistent storage:
volumes:
krakenhashes_data: # Main data directory
name: krakenhashes_app_data
postgres_data: # Database storage
name: krakenhashes_postgres_data
Volume Operations¶
-
Inspect Volumes
-
Backup Volumes
-
Restore Volumes
Storage Driver Optimization¶
For production deployments:
File Cleanup and Maintenance¶
Automated Cleanup¶
The system includes automated cleanup for:
- Temporary Upload Files
- Deleted after successful processing
-
Orphaned files cleaned after 24 hours
-
Old Hashlist Files
- Configurable retention period
- Default: Keep for job lifetime + 30 days
Manual Cleanup Procedures¶
-
Remove Orphaned Files
-
Clean Old Hashlists
-
Vacuum Database
Storage Monitoring¶
-
Disk Usage Monitoring
-
File Count Monitoring
Best Practices¶
- Regular Maintenance Schedule
- Weekly: Check disk usage and clean temp files
- Monthly: Verify file integrity and clean old hashlists
-
Quarterly: Full backup and storage audit
-
Monitoring Alerts
- Set up alerts for >80% disk usage
- Monitor file verification failures
-
Track deduplication efficiency
-
Documentation
- Document custom wordlist sources
- Maintain changelog for rule modifications
- Record storage growth trends
Troubleshooting¶
Common Issues¶
-
Disk Space Exhaustion
-
File Verification Failures
-
Permission Issues