Observability and Monitoring¶

Overview¶

Observability strategy for the ALPR platform covering logging, metrics, and alerting across central server and edge collectors.

Implementation Details: See Logging PRP for logging implementation patterns, code examples, and best practices.

Log Aggregation Platform¶

TBD: Log aggregation platform not yet selected.

Options under consideration: - Grafana Loki (lightweight, pairs with Prometheus) - ELK Stack (Elasticsearch, Logstash, Kibana) - Datadog (SaaS, APM integration) - AWS CloudWatch (if deploying to AWS)

Design logs to be platform-agnostic by following universal structured logging standards.

Logging Standards¶

Format: Single-Line JSON¶

All logs MUST be single-line JSON with timestamp as the first field:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":45,"request_id":"550e8400-e29b-41d4-a716-446655440000","collector_id":"col_abc123","message":"Batch accepted","events_count":50}

Why Single-Line JSON: - ELK: Direct JSON parsing without grok patterns - Splunk: Automatic field extraction - Datadog: Native JSON parsing - Loki: LogQL JSON field extraction - CloudWatch: Filterable dimensions

Required Fields¶

Field	When Required	Type	Description
`timestamp`	Always (FIRST)	string	ISO 8601 with timezone
`level`	Always	string	DEBUG, INFO, WARNING, ERROR, CRITICAL
`service`	Always	string	"central", "edge", "worker"
`message`	Always	string	Human-readable description
`src_ip`	HTTP requests	string	Client IP address
`path`	HTTP requests	string	Request path
`method`	HTTP requests	string	HTTP method
`status_code`	HTTP requests	number	Response status code
`duration_ms`	Timed operations	number	Duration in milliseconds
`request_id`	When available	string	UUID for correlation
`user_id`	When authenticated	string	User identifier
`collector_id`	Edge operations	string	Collector identifier

Log Levels¶

Level	When to Use	Examples
DEBUG	Development diagnostics	Query details, internal state
INFO	Routine operations	Request completed, batch accepted
WARNING	Recoverable issues	Backpressure activated, retry attempt
ERROR	Operation failures	Database error, upload failed
CRITICAL	System failures	Service crash, data corruption

Timestamp Format¶

GOOD: 2025-01-15T12:34:56.789Z
GOOD: 2025-01-15T12:34:56.789+00:00
BAD:  1705322096 (Unix epoch - requires transformation)
BAD:  01/15/2025 12:34:56 (non-standard - requires parsing)

Field Naming Convention¶

Use snake_case consistently:

GOOD: request_id, user_id, status_code, duration_ms
BAD:  requestId, userId, statusCode (camelCase)
BAD:  request-id, user-id (kebab-case)

Log Output Examples¶

Implementation Details: See Logging PRP for implementation code and patterns.

Central Server¶

HTTP Request:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":145,"request_id":"uuid-here","collector_id":"col_abc123","message":"Request completed","events_count":50}

Error with Stack Trace:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"ERROR","service":"central","src_ip":"192.168.1.100","path":"/api/v1/events/batch","method":"POST","status_code":500,"duration_ms":5023,"request_id":"uuid-here","collector_id":"col_abc123","message":"Database write failed","error_type":"OperationalError","error_message":"connection refused","exc_info":"Traceback..."}

Audit Trail:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","src_ip":"192.168.1.50","path":"/api/v1/watchlist","method":"POST","status_code":201,"request_id":"uuid-here","user_id":"usr_admin_001","message":"Audit: CREATE watchlist_entry","action":"CREATE","entity_type":"watchlist_entry","entity_id":"wl_xyz789","changes":{"plate_number":{"old":null,"new":"ABC123"}}}

Storage Operation:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"central","path":"s3://alpr-images/2025/01/15/abc123_full.jpg","status_code":200,"duration_ms":89,"request_id":"uuid-here","message":"Storage upload completed","operation":"upload","bucket":"alpr-images","object_name":"2025/01/15/abc123_full.jpg","file_size_bytes":85432}

Edge Collector¶

Detection Capture:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","camera_id":"cam_entrance","message":"Plate detected","plate_number":"ABC123","confidence":0.95,"direction":"entering"}

Upload Status:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"INFO","service":"edge","collector_id":"col_site01","path":"/api/v1/events/batch","method":"POST","status_code":201,"duration_ms":245,"message":"Batch uploaded","events_count":50}

Connectivity Warning:

{"timestamp":"2025-01-15T12:34:56.789Z","level":"WARNING","service":"edge","collector_id":"col_site01","message":"Central server unreachable","retry_attempt":3,"backoff_seconds":120,"buffer_depth":1500}

Metrics¶

Central Server Metrics¶

Metric	Type	Labels	Description
`alpr_detections_received_total`	Counter	collector_id	Total detections received
`alpr_detections_batch_size`	Histogram	collector_id	Detections per batch
`alpr_request_duration_ms`	Histogram	path, method	Request latency
`alpr_storage_upload_duration_ms`	Histogram	bucket	MinIO upload latency
`alpr_storage_size_bytes`	Histogram	bucket	Object sizes
`alpr_alert_matches_total`	Counter	watchlist_id	Watchlist matches
`alpr_alert_processing_duration_ms`	Histogram	-	Alert engine latency
`alpr_alert_queue_depth`	Gauge	-	Pending alerts
`alpr_websocket_connections`	Gauge	-	Active WebSocket clients
`alpr_db_query_duration_ms`	Histogram	query_type	Database latency
`alpr_collectors_online`	Gauge	-	Connected collectors

Edge Collector Metrics¶

Metric	Type	Labels	Description
`edge_detections_total`	Counter	camera_id	Total detections
`edge_buffer_depth`	Gauge	-	Detections in SQLite buffer
`edge_buffer_oldest_seconds`	Gauge	-	Age of oldest buffered detection
`edge_upload_success_total`	Counter	-	Successful uploads
`edge_upload_failure_total`	Counter	error_type	Failed uploads
`edge_upload_duration_ms`	Histogram	-	Upload latency
`edge_camera_online`	Gauge	camera_id	Camera connectivity
`edge_backpressure_active`	Gauge	-	Backpressure status

System Metrics¶

Collected via node_exporter or equivalent:

Metric	Source	Alert Threshold
CPU usage	node_exporter	>80% sustained
Memory usage	node_exporter	>85%
Disk usage	node_exporter	>80%
Network I/O	node_exporter	Anomaly detection

Alerting Rules¶

Critical Alerts (Immediate Response)¶

Alert	Condition	Response
Central server down	No heartbeat response for 5 min	Page on-call
Database unreachable	Connection failures for 2 min	Page on-call
MinIO unreachable	Upload failures for 5 min	Page on-call
Alert queue overflow	Queue depth > 10,000 for 10 min	Page on-call
Disk full	>95% usage	Page on-call

High Priority Alerts (1-Hour Response)¶

Alert	Condition	Response
Collector offline	No heartbeat for 10 min	Investigate
High error rate	>5% requests failing	Investigate
Slow requests	p99 latency > 5s	Investigate
Backpressure sustained	Active for > 30 min	Scale resources

Warning Alerts (Next Business Day)¶

Alert	Condition	Response
Collector degraded	Heartbeat but cameras offline	Schedule maintenance
Disk usage high	>80%	Plan cleanup/expansion
Certificate expiring	<30 days	Renew certificate
Unusual traffic	10x normal volume	Review for issues

Dashboards¶

Operations Dashboard¶

Panels: - Events ingested (rate, 24h total) - Active collectors (count, map view) - Request latency (p50, p95, p99) - Error rate (%) - Alert queue depth - Storage usage

Collector Health Dashboard¶

Panels: - Collector status table (online/degraded/offline) - Buffer depth per collector - Upload success rate per collector - Camera status per collector - Last heartbeat times

Security Dashboard¶

Panels: - Authentication failures (rate, by IP) - Authorization denials (rate, by user) - API key usage (requests per key) - Unusual access patterns

Log Retention¶

Log Type	Retention	Storage
Application logs	30 days	Log aggregator
Audit logs	7 years	PostgreSQL + log aggregator
Security logs	1 year	Log aggregator
Debug logs	7 days	Local only

Implementation Checklist¶

Logging Setup¶

Implementation Details: See Logging PRP for implementation patterns and code examples.

[ ] Configure structlog with JSON renderer
[ ] Implement timestamp-first processor
[ ] Set up request ID generation/propagation
[ ] Create logging middleware for FastAPI
[ ] Wrap storage operations with logging
[ ] Implement audit logging service

Metrics Setup¶

[ ] Add Prometheus client library
[ ] Instrument HTTP endpoints
[ ] Instrument database queries
[ ] Instrument storage operations
[ ] Instrument alert processing
[ ] Add edge collector metrics

Alerting Setup¶

[ ] Define alerting rules (Prometheus/Grafana)
[ ] Configure notification channels
[ ] Document runbooks for each alert
[ ] Test alert delivery

Dashboard Setup¶

[ ] Create operations dashboard
[ ] Create collector health dashboard
[ ] Create security dashboard
[ ] Configure dashboard refresh rates

Decision Date: 2025-12-29 Status: Draft (log aggregation platform TBD) Rationale: Platform-agnostic logging standards enable flexibility in tooling selection