Lab 1: Load-Test Gate
Given aggregate load-test metrics by route, decide whether the release
can ship. Each route has p95 latency, error rate, cost per successful
turn, and task success delta versus baseline.
Hidden answer: invariant, tests, and Python solution
Invariant: a route can only pass if user experience, reliability,
and cost are all inside budget. Test a clean pass, a latency fail, a
cost fail with quality gain, and missing route metrics.
def evaluate_load_gate(routes, budgets):
failures = []
for name, metrics in routes.items():
budget = budgets[name]
if metrics["p95_ms"] > budget["p95_ms"]:
failures.append((name, "latency", metrics["p95_ms"]))
if metrics["error_rate"] > budget["error_rate"]:
failures.append((name, "errors", metrics["error_rate"]))
cost_over = metrics["cost_per_success"] > budget["cost_per_success"]
quality_gain = metrics["task_success_delta"] >= budget.get("min_gain_for_cost_overrun", 0.02)
if cost_over and not quality_gain:
failures.append((name, "cost_without_quality_gain", metrics["cost_per_success"]))
return {"ship": not failures, "failures": failures}
Lab 2: Retry Storm Detector
Detect when client or service retries are amplifying a dependency
failure during a voice-agent incident.
Hidden answer: common mistakes and Python solution
Common mistakes are alerting on request count alone, ignoring the
baseline, and missing the dependency error signal. Retry storms show
more attempts per original user turn plus rising dependency errors.
def retry_storm_windows(points, attempt_ratio_threshold=1.8, dependency_error_threshold=0.05):
alerts = []
for point in points:
user_turns = max(point["user_turns"], 1)
attempt_ratio = point["service_attempts"] / user_turns
if attempt_ratio >= attempt_ratio_threshold and point["dependency_error_rate"] >= dependency_error_threshold:
alerts.append({
"minute": point["minute"],
"attempt_ratio": round(attempt_ratio, 2),
"dependency_error_rate": point["dependency_error_rate"],
"action": "cap_retries_and_enable_fallback",
})
return alerts
Lab 3: Capacity Step-Load Summary
Summarize a step-load test and find the first concurrency level where
the system violates its latency or error budget.
Hidden answer: invariant and Python solution
Invariant: the supported capacity is the last passing step before
the first sustained violation. This avoids claiming capacity from a
single lucky point after overload has started.
def first_capacity_break(steps, p95_budget_ms, error_budget):
last_passing = None
for step in steps:
passes = step["p95_ms"] <= p95_budget_ms and step["error_rate"] <= error_budget
if passes:
last_passing = step["concurrency"]
continue
return {
"max_supported_concurrency": last_passing,
"first_failing_concurrency": step["concurrency"],
"p95_ms": step["p95_ms"],
"error_rate": step["error_rate"],
}
return {"max_supported_concurrency": last_passing, "first_failing_concurrency": None}