Skip to content

Monitoring

Monitoring and observability for Lager Guru deployments.

Overview

Comprehensive monitoring is essential for maintaining system health, performance, and reliability. This document covers monitoring strategies and tools.

Monitoring Stack

Application Monitoring

Performance Metrics

  • Response times
  • Request rates
  • Error rates
  • Throughput

Business Metrics

  • Order processing rates
  • Inventory movements
  • Worker productivity
  • System utilization

Infrastructure Monitoring

Server Metrics

  • CPU usage
  • Memory consumption
  • Disk I/O
  • Network traffic

Database Metrics

  • Query performance
  • Connection pool usage
  • Replication lag
  • Storage usage

Key Metrics

Application Metrics

Request Metrics

  • Requests per second: API request rate
  • Response time: P50, P95, P99 latencies
  • Error rate: Percentage of failed requests
  • Success rate: Percentage of successful requests

Database Metrics

  • Query duration: Average query execution time
  • Query rate: Queries per second
  • Connection count: Active database connections
  • Transaction rate: Transactions per second

Business Metrics

  • Orders processed: Orders per hour/day
  • Inventory movements: Movements per hour
  • Task completion: Tasks completed per hour
  • Worker activity: Active workers

Infrastructure Metrics

Server Health

  • CPU utilization: Average CPU usage
  • Memory usage: RAM consumption
  • Disk usage: Storage utilization
  • Network I/O: Network throughput

Database Health

  • Database size: Total database size
  • Table sizes: Individual table sizes
  • Index usage: Index hit rates
  • Cache hit ratio: Query cache performance

Monitoring Tools

Application Performance Monitoring (APM)

Supabase Dashboard

  • Real-time metrics
  • Query performance
  • API usage statistics
  • Error tracking

Custom Dashboards

  • Grafana dashboards
  • Custom metrics collection
  • Business intelligence tools

Logging

Application Logs

  • Error logs
  • Access logs
  • Audit logs
  • Performance logs

Log Aggregation

  • Centralized log collection
  • Log search and analysis
  • Alerting on log patterns

Alerting

Critical Alerts

System Health

  • High error rate: Error rate > 5%
  • Slow responses: P95 latency > 1s
  • Database down: Database unavailable
  • High CPU: CPU usage > 80%

Business Critical

  • Order processing failure: Orders not processing
  • Inventory sync failure: Inventory not updating
  • Worker offline: Critical workers offline
  • Payment failures: Payment processing errors

Alert Configuration

yaml
# Example alert configuration
alerts:
  - name: high_error_rate
    condition: error_rate > 0.05
    duration: 5m
    severity: critical
    notification: email, slack
    
  - name: slow_response_time
    condition: p95_latency > 1000ms
    duration: 10m
    severity: warning
    notification: slack

Dashboards

System Dashboard

Monitor overall system health:

  • Request rates and latencies
  • Error rates
  • Database performance
  • Server resource usage

Business Dashboard

Track business metrics:

  • Orders processed
  • Inventory levels
  • Worker productivity
  • Revenue metrics

Tenant Dashboard

Monitor per-tenant metrics:

  • Tenant activity
  • Resource usage
  • Performance metrics
  • Usage statistics

Best Practices

Monitoring Strategy

  1. Define SLAs: Establish service level agreements
  2. Set Baselines: Establish normal operating ranges
  3. Monitor Trends: Track metrics over time
  4. Alert Wisely: Avoid alert fatigue
  5. Review Regularly: Regular metric reviews

Performance Monitoring

  1. Track Key Metrics: Monitor critical performance indicators
  2. Set Thresholds: Define acceptable performance ranges
  3. Investigate Anomalies: Investigate unusual patterns
  4. Optimize Continuously: Regular performance optimization

Incident Response

  1. Quick Detection: Fast alerting on issues
  2. Clear Escalation: Defined escalation paths
  3. Documentation: Document incidents and resolutions
  4. Post-Mortems: Learn from incidents

Example: Monitoring Setup

Metrics Collection

typescript
// Example metrics collection
import { metrics } from './monitoring';

// Track API request
metrics.increment('api.requests', {
  endpoint: '/api/orders',
  method: 'POST',
  status: '200'
});

// Track response time
metrics.timing('api.response_time', duration, {
  endpoint: '/api/orders'
});

// Track business metric
metrics.gauge('orders.processed', orderCount);

Dashboard Queries

sql
-- Orders processed per hour
SELECT 
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as orders
FROM orders
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;

-- Average response time
SELECT 
    AVG(response_time) as avg_response,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time) as p95
FROM api_logs
WHERE created_at >= NOW() - INTERVAL '1 hour';

Troubleshooting

Common Issues

High error rates

  • Check application logs
  • Review database performance
  • Verify external service status
  • Check system resources

Slow performance

  • Analyze slow queries
  • Check database indexes
  • Review server resources
  • Investigate network issues

Missing metrics

  • Verify metric collection
  • Check monitoring agent status
  • Review configuration
  • Validate data pipeline

Next Steps

Released under Commercial License