Observability and Monitoring¶
Overview¶
Observability strategy for the ALPR platform covering logging, metrics, and alerting across central server and edge collectors.
Implementation Details: See Logging PRP for logging implementation patterns, code examples, and best practices.
Log Aggregation Platform¶
TBD: Log aggregation platform not yet selected.
Options under consideration: - Grafana Loki (lightweight, pairs with Prometheus) - ELK Stack (Elasticsearch, Logstash, Kibana) - Datadog (SaaS, APM integration) - AWS CloudWatch (if deploying to AWS)
Design logs to be platform-agnostic by following universal structured logging standards.
Logging Standards¶
Format: Single-Line JSON¶
All logs MUST be single-line JSON with timestamp as the first field:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":45,"request_id":"550e8400-e29b-41d4-a716-446655440000","collector_id":"col_abc123","message":"Batch accepted","events_count":50}
Why Single-Line JSON: - ELK: Direct JSON parsing without grok patterns - Splunk: Automatic field extraction - Datadog: Native JSON parsing - Loki: LogQL JSON field extraction - CloudWatch: Filterable dimensions
Required Fields¶
| Field | When Required | Type | Description |
|---|---|---|---|
timestamp |
Always (FIRST) | string | ISO 8601 with timezone |
level |
Always | string | DEBUG, INFO, WARNING, ERROR, CRITICAL |
service |
Always | string | "central", "edge", "worker" |
message |
Always | string | Human-readable description |
src_ip |
HTTP requests | string | Client IP address |
path |
HTTP requests | string | Request path |
method |
HTTP requests | string | HTTP method |
status_code |
HTTP requests | number | Response status code |
duration_ms |
Timed operations | number | Duration in milliseconds |
request_id |
When available | string | UUID for correlation |
user_id |
When authenticated | string | User identifier |
collector_id |
Edge operations | string | Collector identifier |
Log Levels¶
| Level | When to Use | Examples |
|---|---|---|
| DEBUG | Development diagnostics | Query details, internal state |
| INFO | Routine operations | Request completed, batch accepted |
| WARNING | Recoverable issues | Backpressure activated, retry attempt |
| ERROR | Operation failures | Database error, upload failed |
| CRITICAL | System failures | Service crash, data corruption |
Timestamp Format¶
GOOD: 2025-01-15T12:34:56.789Z
GOOD: 2025-01-15T12:34:56.789+00:00
BAD: 1705322096 (Unix epoch - requires transformation)
BAD: 01/15/2025 12:34:56 (non-standard - requires parsing)
Field Naming Convention¶
Use snake_case consistently:
GOOD: request_id, user_id, status_code, duration_ms
BAD: requestId, userId, statusCode (camelCase)
BAD: request-id, user-id (kebab-case)
Log Output Examples¶
Implementation Details: See Logging PRP for implementation code and patterns.
Central Server¶
HTTP Request:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":145,"request_id":"uuid-here","collector_id":"col_abc123","message":"Request completed","events_count":50}
Error with Stack Trace:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"ERROR","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":500,"duration_ms":5023,"request_id":"uuid-here","collector_id":"col_abc123","message":"Database write failed","error_type":"OperationalError","error_message":"connection refused","exc_info":"Traceback..."}
Audit Trail:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.50","path":"/api/v1/watchlist","method":"POST","status_code":201,"request_id":"uuid-here","user_id":"usr_admin_001","message":"Audit: CREATE watchlist_entry","action":"CREATE","entity_type":"watchlist_entry","entity_id":"wl_xyz789","changes":{"plate_number":{"old":null,"new":"ABC123"}}}
Storage Operation:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","path":"s3://alpr-images/2025/01/15/abc123_full.jpg","status_code":200,"duration_ms":89,"request_id":"uuid-here","message":"Storage upload completed","operation":"upload","bucket":"alpr-images","object_name":"2025/01/15/abc123_full.jpg","file_size_bytes":85432}
Edge Collector¶
Detection Capture:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","camera_id":"cam_entrance","message":"Plate detected","plate_number":"ABC123","confidence":0.95,"direction":"entering"}
Upload Status:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":245,"message":"Batch uploaded","events_count":50}
Connectivity Warning:
{"timestamp":"2025-01-15T12:34:56.789Z","level":"WARNING","service":"edge","collector_id":"col_site01","message":"Central server unreachable","retry_attempt":3,"backoff_seconds":120,"buffer_depth":1500}
Metrics¶
Central Server Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
alpr_detections_received_total |
Counter | collector_id | Total detections received |
alpr_detections_batch_size |
Histogram | collector_id | Detections per batch |
alpr_request_duration_ms |
Histogram | path, method | Request latency |
alpr_storage_upload_duration_ms |
Histogram | bucket | MinIO upload latency |
alpr_storage_size_bytes |
Histogram | bucket | Object sizes |
alpr_alert_matches_total |
Counter | watchlist_id | Watchlist matches |
alpr_alert_processing_duration_ms |
Histogram | - | Alert engine latency |
alpr_alert_queue_depth |
Gauge | - | Pending alerts |
alpr_websocket_connections |
Gauge | - | Active WebSocket clients |
alpr_db_query_duration_ms |
Histogram | query_type | Database latency |
alpr_collectors_online |
Gauge | - | Connected collectors |
Edge Collector Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
edge_detections_total |
Counter | camera_id | Total detections |
edge_buffer_depth |
Gauge | - | Detections in SQLite buffer |
edge_buffer_oldest_seconds |
Gauge | - | Age of oldest buffered detection |
edge_upload_success_total |
Counter | - | Successful uploads |
edge_upload_failure_total |
Counter | error_type | Failed uploads |
edge_upload_duration_ms |
Histogram | - | Upload latency |
edge_camera_online |
Gauge | camera_id | Camera connectivity |
edge_backpressure_active |
Gauge | - | Backpressure status |
System Metrics¶
Collected via node_exporter or equivalent:
| Metric | Source | Alert Threshold |
|---|---|---|
| CPU usage | node_exporter | >80% sustained |
| Memory usage | node_exporter | >85% |
| Disk usage | node_exporter | >80% |
| Network I/O | node_exporter | Anomaly detection |
Alerting Rules¶
Critical Alerts (Immediate Response)¶
| Alert | Condition | Response |
|---|---|---|
| Central server down | No heartbeat response for 5 min | Page on-call |
| Database unreachable | Connection failures for 2 min | Page on-call |
| MinIO unreachable | Upload failures for 5 min | Page on-call |
| Alert queue overflow | Queue depth > 10,000 for 10 min | Page on-call |
| Disk full | >95% usage | Page on-call |
High Priority Alerts (1-Hour Response)¶
| Alert | Condition | Response |
|---|---|---|
| Collector offline | No heartbeat for 10 min | Investigate |
| High error rate | >5% requests failing | Investigate |
| Slow requests | p99 latency > 5s | Investigate |
| Backpressure sustained | Active for > 30 min | Scale resources |
Warning Alerts (Next Business Day)¶
| Alert | Condition | Response |
|---|---|---|
| Collector degraded | Heartbeat but cameras offline | Schedule maintenance |
| Disk usage high | >80% | Plan cleanup/expansion |
| Certificate expiring | <30 days | Renew certificate |
| Unusual traffic | 10x normal volume | Review for issues |
Dashboards¶
Operations Dashboard¶
Panels: - Events ingested (rate, 24h total) - Active collectors (count, map view) - Request latency (p50, p95, p99) - Error rate (%) - Alert queue depth - Storage usage
Collector Health Dashboard¶
Panels: - Collector status table (online/degraded/offline) - Buffer depth per collector - Upload success rate per collector - Camera status per collector - Last heartbeat times
Security Dashboard¶
Panels: - Authentication failures (rate, by IP) - Authorization denials (rate, by user) - API key usage (requests per key) - Unusual access patterns
Log Retention¶
| Log Type | Retention | Storage |
|---|---|---|
| Application logs | 30 days | Log aggregator |
| Audit logs | 7 years | PostgreSQL + log aggregator |
| Security logs | 1 year | Log aggregator |
| Debug logs | 7 days | Local only |
Implementation Checklist¶
Logging Setup¶
Implementation Details: See Logging PRP for implementation patterns and code examples.
- [ ] Configure structlog with JSON renderer
- [ ] Implement timestamp-first processor
- [ ] Set up request ID generation/propagation
- [ ] Create logging middleware for FastAPI
- [ ] Wrap storage operations with logging
- [ ] Implement audit logging service
Metrics Setup¶
- [ ] Add Prometheus client library
- [ ] Instrument HTTP endpoints
- [ ] Instrument database queries
- [ ] Instrument storage operations
- [ ] Instrument alert processing
- [ ] Add edge collector metrics
Alerting Setup¶
- [ ] Define alerting rules (Prometheus/Grafana)
- [ ] Configure notification channels
- [ ] Document runbooks for each alert
- [ ] Test alert delivery
Dashboard Setup¶
- [ ] Create operations dashboard
- [ ] Create collector health dashboard
- [ ] Create security dashboard
- [ ] Configure dashboard refresh rates
Decision Date: 2025-12-29 Status: Draft (log aggregation platform TBD) Rationale: Platform-agnostic logging standards enable flexibility in tooling selection