Serving Signals
Queue age, batch size, timeout count, retry count, GPU memory, CPU preprocess time, cold starts, autoscaler lag, and fallback rate.
Hidden answer: first comparison
Compare the same signals by model version and traffic cohort. A p99 spike with unchanged model time points at queues, routing, autoscaling, batching, retries, or dependency calls. A model-time spike points at longer inputs, larger beams, precision changes, cache misses, device placement, or a new model artifact.