Overview
CloudGaming provides multi-layer monitoring across WebRTC streaming, signaling infrastructure, and host health. This guide covers all available metrics, health checks, and monitoring best practices.WebRTC Statistics
Real-Time Transport Metrics
The Go/Pion WebRTC implementation tracks comprehensive transport statistics: Packet Loss and Retransmission- Packet Loss - Percentage of lost RTP packets
- RTT (Round-Trip Time) - Network latency in milliseconds
- Jitter - Packet arrival time variance
- NACK Count - Number of retransmission requests
- PLI Count - Number of keyframe requests
- Send Bitrate - Current video bitrate in kbps
- Pacer Queue Length - Number of frames waiting to send
Stats Monitoring Implementation
RTCP Feedback
RTCP (RTP Control Protocol) provides real-time feedback:Audio Queue Monitoring
Queue Health Metrics
Audio queue depth indicates network congestion:- GOOD: Average queue depth < 2.0 packets
- WARNING: Average queue depth > 2.0 packets
- CRITICAL: Average queue depth > 2.8 packets
Buffer Pool Health
Memory Management Monitoring
The tiered buffer pool tracks allocation efficiency:- Hit Rate 95%+: Excellent - minimal allocations
- Hit Rate 90-95%: Good - some allocations expected
- Hit Rate 80-90%: Moderate - consider pool tuning
- Hit Rate below 80%: Poor - high GC pressure
Signaling Server Metrics
Prometheus Metrics Endpoint
The signaling server exposes metrics at/metrics:
Available Metrics
Connection Metrics:Implementation Reference
SeeServer/metrics.js:1-117 for the complete metrics implementation.
Matchmaker Monitoring
Host Health Tracking
The matchmaker monitors host heartbeats: Heartbeat Endpoint:Host TTL Monitoring
Health Check Endpoints
Signaling Server
Liveness Probe:Matchmaker
Health Endpoints:200 OK immediately to prevent Railway from killing the container.
Host Configuration Monitoring
Monitor these key settings fromconfig.json:
Video Configuration
Capture Settings
Audio Configuration
Redis Monitoring
Circuit Breaker
Protects against Redis failures:- New connections rejected with
1013 Service unavailable - Existing connections continue working
- Circuit auto-closes after timeout
Connection Status
Monitoring Best Practices
Alerting Thresholds
Critical Alerts:- WebRTC packet loss > 5%
- RTT > 150ms for sustained period
- Audio queue depth > 2.8 (CRITICAL)
- Buffer pool hit rate < 80%
- Redis circuit breaker open
- Signaling server Redis disconnected
- WebRTC packet loss > 2%
- RTT > 100ms
- Audio queue depth > 2.0 (WARNING)
- Buffer pool hit rate < 90%
- Host heartbeat TTL < 10 seconds
- Rate limit drops increasing
Log Aggregation
Key Log Patterns:Grafana Dashboard Example
Panels to Include:- Active connections (signaling_active_connections)
- Message throughput (rate(signaling_messages_forwarded_total[1m]))
- Redis latency (signaling_redis_cmd_latency_seconds)
- WebRTC packet loss percentage
- Audio queue depth over time
- Buffer pool hit rate
- Host heartbeat count
Next Steps
- Performance Tuning - Optimize metrics based on monitoring data
- Troubleshooting - Debug issues found in monitoring