File Hash Cache System¶
Overview¶
The File Hash Cache is an in-memory caching system that optimizes the directory monitor service by eliminating redundant MD5 hash calculations for unchanged files. This dramatically reduces disk I/O and prevents SSD wear in deployments with large wordlists and rule files.
Problem Statement¶
The directory monitor service runs every 30 seconds to detect changes in wordlist and rule files. Before the cache implementation:
- Every scan calculated MD5 hashes for ALL files regardless of whether they changed
- For large wordlists (10-15GB+), this caused ~500MB/s constant disk I/O
- SSD wear: Continuous reading would rapidly wear out solid-state drives
- Resource waste: CPU cycles spent hashing unchanged files
Solution Architecture¶
File Hash Cache (backend/internal/cache/filehash/cache.go)¶
The cache stores file metadata alongside cached hash values:
type CachedFileInfo struct {
Path string
ModTime time.Time
Size int64
MD5Hash string
}
type Cache struct {
entries map[string]CachedFileInfo
mu sync.RWMutex
}
Key Features:
- ModTime+Size Validation: Before recalculating MD5, the cache checks if the file's modification time and size have changed
- Thread-Safe: Uses RWMutex for concurrent read access with exclusive writes
- Self-Populating: Cache entries are created on first access via
GetOrCalculate() - Background Population: Asynchronous startup population to avoid blocking server start
Cache Lookup Flow¶
GetOrCalculate(filePath)
│
├── os.Stat(filePath) → Get current modTime, size
│
├── RLock → Check cache
│ │
│ └── Cache hit? (modTime AND size match)
│ │
│ ├── YES → Return cached hash (no disk read)
│ │
│ └── NO → Calculate MD5, update cache
│
└── Return hash
Integration Points¶
┌─────────────────────┐
│ main.go │
│ (creates cache) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ MonitorService │
│ (receives cache) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ DirectoryMonitor │
│ (uses cache for │
│ hash lookups) │
└─────────────────────┘
Potfile Hash History¶
The Problem¶
During heavy crack ingestion (e.g., processing 24 million cracked passwords over several hours):
- Potfile MD5 changes every few seconds as batches are written
- Agent downloads potfile (takes ~30 seconds for large files)
- By the time download completes, the potfile has changed
- Agent's hash doesn't match current hash → triggers re-download
- Infinite loop: Agent continuously re-downloads the potfile
The Solution: Rolling Hash History¶
The PotfileHistory maintains a 5-minute window of recent potfile hashes:
type PotfileHashEntry struct {
MD5Hash string
Timestamp time.Time
Size int64
}
type PotfileHistory struct {
entries []PotfileHashEntry
maxAge time.Duration // 5 minutes
mu sync.RWMutex
}
How It Works¶
- Recording: After each potfile update, the new MD5 hash is added to the history
- Validation: When an agent reports its potfile hash, the system checks if it matches ANY hash in the 5-minute window
- Acceptance: If the agent's hash is in the history, the potfile is considered "in sync enough"
- Expiration: After ingestion stops, old hashes expire, ensuring eventual consistency
Flow During Heavy Ingestion¶
Heavy Ingestion Scenario:
t=0: Batch N written → MD5_N added to history
t=5: Agent starts downloading potfile
t=10: Batch N+1 written → MD5_N+1 added to history
t=35: Agent finishes download with MD5_N
t=35: File sync check: "Is MD5_N valid?"
→ potfileHistory.IsValid(MD5_N) = TRUE (within 5-min window)
→ Agent skips re-download
After ingestion stops (5+ minutes idle):
t=340: Old hashes expire from history
t=345: Next agent sync: only current MD5 in history
t=350: Agent with old MD5 → IsValid() = FALSE → downloads latest
WebSocket Handler Integration¶
In determineFilesToSync(), the potfile check occurs before standard MD5 comparison:
for _, file := range backendFiles {
agentFile, exists := agentFileMap[key]
// Special handling for potfile during heavy ingestion
if file.FileType == "wordlist" && strings.HasSuffix(file.Name, "potfile.txt") {
if exists && h.potfileHistory.IsValid(agentFile.MD5Hash) {
// Agent has a recent valid potfile - skip re-download
continue
}
}
// Normal comparison for all other files
if !exists || agentFile.MD5Hash != file.MD5Hash {
filesToSync = append(filesToSync, file)
}
}
Implementation Details¶
Files Created¶
| File | Purpose |
|---|---|
backend/internal/cache/filehash/cache.go | File hash cache with modTime+size validation |
backend/internal/cache/filehash/potfile_history.go | Rolling 5-minute potfile hash history |
Files Modified¶
| File | Changes |
|---|---|
backend/internal/monitor/directory_monitor.go | Inject cache, use GetOrCalculate() |
backend/internal/services/monitor_service.go | Accept and pass cache to DirectoryMonitor |
backend/cmd/server/main.go | Create cache and history, wire dependencies |
backend/internal/services/potfile_service.go | Add hash to history after updates |
backend/internal/handlers/websocket/handler.go | Check potfile history during sync |
backend/internal/routes/websocket_with_jobs.go | Pass potfileHistory to handler |
Performance Metrics¶
Before vs After¶
| Metric | Before | After | Improvement |
|---|---|---|---|
| Disk I/O (steady state) | ~500 MB/s | Near zero | 99%+ reduction |
| MD5 calculations per cycle | All files | Only changed files | Variable |
| Memory usage | N/A | ~100 bytes/file | Minimal |
| Agent potfile re-downloads during ingestion | Continuous | None | 100% reduction |
Memory Footprint¶
- File hash cache: ~100 bytes per file entry
- Potfile history: ~50 bytes per entry, pruned every 5 minutes
- Typical deployment: <10MB total memory overhead
Configuration¶
No configuration required. The file hash cache and potfile history are:
- Automatically initialized at server startup
- Self-managing (automatic population and expiration)
- Transparent to users and administrators
Startup Behavior¶
- Cache is created empty
- Background goroutine populates cache by walking directories
- Server continues starting without waiting for population
- Cache entries are also created on-demand during directory scans
Skip Patterns¶
The following patterns are excluded from cache population:
potfile.txt- Handled separately via potfile historyassociation/- Association attack wordlists are job-specific
Debugging¶
Log Messages¶
Cache activity is logged at DEBUG level:
DEBUG: Skipping wordlist with unchanged hash: general/crackstation.txt
DEBUG: Skipping rule with unchanged hash: hashcat/best64.rule
DEBUG: Agent has valid recent potfile hash 8ef087e..., skipping sync
Verifying Cache is Working¶
Check backend logs for "Skipping ... with unchanged hash" messages during directory monitor cycles.
Verifying Potfile History¶
Look for "Agent has valid recent potfile hash" messages during agent file sync operations.
Risk Assessment¶
| Risk | Likelihood | Mitigation |
|---|---|---|
| Cache returns stale hash | Low | modTime+size checked on every access |
| Memory exhaustion | Very Low | ~100 bytes per file, bounded by filesystem |
| Concurrency issues | Low | RWMutex pattern, proven in production |
| Potfile sync issues | Low | 5-minute window ensures eventual consistency |
Related Documentation¶
- Job Update System - How file changes trigger job updates
- Potfile Management - Potfile operational guide
- Performance Tuning - General performance optimization