Monitoring
Monitoring and observability for Lager Guru deployments.
Overview
Comprehensive monitoring is essential for maintaining system health, performance, and reliability. This document covers monitoring strategies and tools.
Monitoring Stack
Application Monitoring
Performance Metrics
- Response times
- Request rates
- Error rates
- Throughput
Business Metrics
- Order processing rates
- Inventory movements
- Worker productivity
- System utilization
Infrastructure Monitoring
Server Metrics
- CPU usage
- Memory consumption
- Disk I/O
- Network traffic
Database Metrics
- Query performance
- Connection pool usage
- Replication lag
- Storage usage
Key Metrics
Application Metrics
Request Metrics
- Requests per second: API request rate
- Response time: P50, P95, P99 latencies
- Error rate: Percentage of failed requests
- Success rate: Percentage of successful requests
Database Metrics
- Query duration: Average query execution time
- Query rate: Queries per second
- Connection count: Active database connections
- Transaction rate: Transactions per second
Business Metrics
- Orders processed: Orders per hour/day
- Inventory movements: Movements per hour
- Task completion: Tasks completed per hour
- Worker activity: Active workers
Infrastructure Metrics
Server Health
- CPU utilization: Average CPU usage
- Memory usage: RAM consumption
- Disk usage: Storage utilization
- Network I/O: Network throughput
Database Health
- Database size: Total database size
- Table sizes: Individual table sizes
- Index usage: Index hit rates
- Cache hit ratio: Query cache performance
Monitoring Tools
Application Performance Monitoring (APM)
Supabase Dashboard
- Real-time metrics
- Query performance
- API usage statistics
- Error tracking
Custom Dashboards
- Grafana dashboards
- Custom metrics collection
- Business intelligence tools
Logging
Application Logs
- Error logs
- Access logs
- Audit logs
- Performance logs
Log Aggregation
- Centralized log collection
- Log search and analysis
- Alerting on log patterns
Alerting
Critical Alerts
System Health
- High error rate: Error rate > 5%
- Slow responses: P95 latency > 1s
- Database down: Database unavailable
- High CPU: CPU usage > 80%
Business Critical
- Order processing failure: Orders not processing
- Inventory sync failure: Inventory not updating
- Worker offline: Critical workers offline
- Payment failures: Payment processing errors
Alert Configuration
yaml
# Example alert configuration
alerts:
- name: high_error_rate
condition: error_rate > 0.05
duration: 5m
severity: critical
notification: email, slack
- name: slow_response_time
condition: p95_latency > 1000ms
duration: 10m
severity: warning
notification: slackDashboards
System Dashboard
Monitor overall system health:
- Request rates and latencies
- Error rates
- Database performance
- Server resource usage
Business Dashboard
Track business metrics:
- Orders processed
- Inventory levels
- Worker productivity
- Revenue metrics
Tenant Dashboard
Monitor per-tenant metrics:
- Tenant activity
- Resource usage
- Performance metrics
- Usage statistics
Best Practices
Monitoring Strategy
- Define SLAs: Establish service level agreements
- Set Baselines: Establish normal operating ranges
- Monitor Trends: Track metrics over time
- Alert Wisely: Avoid alert fatigue
- Review Regularly: Regular metric reviews
Performance Monitoring
- Track Key Metrics: Monitor critical performance indicators
- Set Thresholds: Define acceptable performance ranges
- Investigate Anomalies: Investigate unusual patterns
- Optimize Continuously: Regular performance optimization
Incident Response
- Quick Detection: Fast alerting on issues
- Clear Escalation: Defined escalation paths
- Documentation: Document incidents and resolutions
- Post-Mortems: Learn from incidents
Example: Monitoring Setup
Metrics Collection
typescript
// Example metrics collection
import { metrics } from './monitoring';
// Track API request
metrics.increment('api.requests', {
endpoint: '/api/orders',
method: 'POST',
status: '200'
});
// Track response time
metrics.timing('api.response_time', duration, {
endpoint: '/api/orders'
});
// Track business metric
metrics.gauge('orders.processed', orderCount);Dashboard Queries
sql
-- Orders processed per hour
SELECT
DATE_TRUNC('hour', created_at) as hour,
COUNT(*) as orders
FROM orders
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;
-- Average response time
SELECT
AVG(response_time) as avg_response,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time) as p95
FROM api_logs
WHERE created_at >= NOW() - INTERVAL '1 hour';Troubleshooting
Common Issues
High error rates
- Check application logs
- Review database performance
- Verify external service status
- Check system resources
Slow performance
- Analyze slow queries
- Check database indexes
- Review server resources
- Investigate network issues
Missing metrics
- Verify metric collection
- Check monitoring agent status
- Review configuration
- Validate data pipeline
Related Documentation
- Logging - Logging strategies
- Backups - Backup procedures
- Deployment - Deployment guide
- Troubleshooting - Troubleshooting guide
Next Steps
- Logging - Logging configuration
- Backups - Backup setup
- Deployment Guide - Deployment documentation