Week 9：Agents Post-Deployment

本週你會學到什麼：把 AI agent 推上 production 之後的世界 — observability、incident response、SRE（Site Reliability Engineering）。當你的服務出事，AI agent 怎麼自動 triage、debug、甚至自動修復？Friday 由 Resolve.ai 的 CTO Mayank Agarwal + Technical Staff Milind Ganjoo 開講（Resolve 在做 AI-native on-call）。

💡 對非資工背景讀者：本週很多概念來自 production engineering 世界（Kubernetes、observability、SRE），如果你只做 side project 沒上 production，可以挑「核心概念導讀」+「對 vibe coder 的應用」讀就好。本週會大量用醫療類比（observability ≈ 病歷 + 生命徵象 + 影像、incident response ≈ 急救流程、postmortem ≈ M&M conference、on-call ≈ night float）幫你建立直覺。

學習目標

完成本週後，你應該能：

解釋 SRE 的核心三角（SLI/SLO/SLA）以及為什麼 observability ≠ monitoring
辨識 AI agent 的特殊 observability 需求（token usage、tool call latency、hallucination rate、context drift）
設計一個 incident response workflow 含 AI auto-triage step
評估 Resolve.ai / PagerDuty AI / Datadog AI agent 在 on-call 場景的能力差異

核心概念導讀

一、SRE 入門：把營運當軟體工程問題解

W1-W8 講的都是「把 code 寫出來」。但 production system 的真正挑戰不在寫，而在「上線之後」 — 7×24 小時持續運轉、流量會起伏、依賴會壞、客戶會抱怨、半夜會被 page。在這個世界裡，Google 提出的 SRE（Site Reliability Engineering，網站可靠度工程） 是業界共識的解法。SRE 的核心不是「找一群運維」，而是「找軟體工程師來設計營運團隊，把 manual operation 當成 bug 來自動化掉」。

Google SRE Book 列了三個基礎紀律，所有 production system 都適用：

SLI / SLO / SLA（service level indicator / objective / agreement） — SLI 是測量值（如 p99 latency = 187 ms）、SLO 是內部目標（99.9% 請求 < 200 ms）、SLA 是對客戶的合約承諾（達不到要賠錢）。三者依序由內而外、由緊而鬆。
Error budget（錯誤預算） — 既然 100% availability 既不可能也無價值（每加一個 9 成本指數成長），就明訂 SLO（例：99.99%），剩下 0.01% 是「可以拿來冒險的預算」。新功能上線會吃預算，預算用完就停止 release。
50% rule — 工程師花在 toil（重複瑣事）的時間不得超過一半，另一半必須做工程改善。違反這條 SRE 團隊就會隨流量線性膨脹，跟傳統 sysadmin 一樣。

💡 譯解（醫療類比）：SLO 像住院病人的「目標生命徵象」 — 不是要把 BP 維持在 120/80 不准動，是「BP 維持在 100-140/60-90 之間就算 ok」。Error budget 就是「下緣到目標之間的容忍空間」 — 病人 BP 在 105 還是 ok，可以做 challenging 的事（早期 ambulation、調藥），如果預算用光（BP 已經在 100），就先別動。Blame-free postmortem 就像 M&M（morbidity & mortality conference）— 不是要找哪個住院醫師簽錯醫囑，是要找系統性漏洞（為什麼 EHR 沒擋下這個 dose？為什麼交班沒講到 allergy？）。

把 SRE mindset 套到 LLM agent：你的 agent 也會壞 — hallucination、tool call timeout、context overflow、prompt injection。關鍵是先量化什麼叫「ok」（例：hallucination rate < 0.5%、p95 response time < 5 s、tool success rate > 95%），然後用 error budget 管理 release 風險（這個 prompt 改完 hallucination rate 飆到 1%，立刻 rollback）。

二、Observability 三柱：logs / metrics / traces

Observability Basics 釐清一個常被混用的概念：monitoring（監控）只告訴你「什麼壞了」（CPU 高、5xx 多）；observability（可觀測性）要回答「為什麼壞」。在 microservices（微服務）架構下，一個 user request 可能跨 10+ 個 service，光看單一 service 的 metrics 完全不知道哪一段慢。Observability 的三柱各司其職，缺一不可：

柱	是什麼	醫療類比	LLM agent 場景例
Logs（日誌）	離散文字事件記錄（誰、何時、做了什麼）	病歷紀錄 — 每個診間、每次處置的逐筆紀錄	”User asked X, agent called tool Y, got result Z”
Metrics（指標）	時間序列數值（QPS、p99 latency、error rate）	生命徵象 — HR / BP / SpO₂ 持續監測	”average tokens per request”、“tool call success rate”
Traces（追蹤）	一個 request 從進入點到回應的完整路徑	影像檢查（CT / MRI）— 一次性看全身結構與問題位置	”user query → LLM call → MCP tool → DB → 第二次 LLM call” 整條路徑

Trace 由多個 span（區段） 組成，每個 span 是一個工作單位（一次 DB query、一次 RPC call、一次 LLM call），spans 用 parent-child 關係組成階層。關鍵是 context propagation（上下文傳播） — trace ID 必須跨 service 傳遞才能串成完整路徑。OpenTelemetry（OTel） 是業界中立的 instrumentation 標準，所有大廠（Datadog、New Relic、Honeycomb、Grafana）都支援，意味著你 instrument 一次、可以隨時換 vendor。

💡 譯解：你想 debug 一個 LLM agent「怎麼會回這個怪答案」 — 看 logs 像翻一頁頁病歷找線索（耗時、容易漏）、看 metrics 像看 vital sign trend chart（知道某時段惡化但不知為何）、看 traces 像直接拉出那次 request 的「全身 CT」 — 從 user query 進來到 response 出去的每一步、每一步的 input / output / 耗時都看得到。Production agent 沒有 traces 等於 debug 黑箱。

三柱必須相互關聯才有用 — trace ID 要寫進 log line、metric 要帶 trace exemplar，這樣你看到 metric spike 時能一鍵跳到那個時段的 traces，看到 trace 異常時能一鍵看相關 logs。

三、AI agent 的特殊 observability 需求

傳統 observability 三柱足夠看 web service。但 LLM agent 多了幾個維度，傳統工具看不到：

Token usage（token 消耗） — 每個 request 用了多少 input / output token、累計成本多少。沒看會被 OpenAI / Anthropic 帳單嚇到。
Tool call latency（工具呼叫延遲） — Agent 一次任務可能 chain 5-10 個 tool call，每個 tool 的耗時、成功率、failure mode 必須個別 instrument。某個慢 tool 會拖垮整個 agent 的 p95。
Hallucination rate（幻覺率） — 這個最難量化但最重要。常見做法：(a) 對 critical claim 跑 fact-check pipeline、(b) 抽樣人工審核、(c) 用 LLM-as-judge（另一個 model 當審查員）。
Context drift（上下文漂移） — 長對話 / 多輪 agent loop 中，context window 累積到後期 model 會偏離 system prompt 的指示（俗稱 “agent 走鐘”）。需 instrument 每輪的 prompt 大小、context 中各部分占比。
Tool selection accuracy（工具選擇準確率） — Agent 有 N 個 tool 可用時，它選對了嗎？常見指標：選對率、平均嘗試次數、wrong tool 後恢復率。

💡 譯解：傳統 web service 的 observability 像加護病房的 monitor — 量 BP / HR / SpO₂ / EtCO₂ 就大致知道病人狀況。LLM agent 多了「精神狀態」這條軸 — 即使 vital sign 都正常，他可能 confused / disoriented / hallucinating。你需要額外的 GCS（Glasgow Coma Scale）、CAM-ICU（confusion assessment method）那種專門 instrument 才測得到。Hallucination rate 與 context drift 就是 LLM agent 的「精神狀態評估」。

工具上 Langfuse、Helicone、Arize Phoenix 是專做 LLM observability 的開源/商用方案，本質上把 trace 模型擴充支援 LLM call 與 tool call，再加上 token / cost / hallucination eval 維度。

四、Multi-agent system 在 SRE：MDT 會議模式

Multi-Agent Systems for AI-Native Engineering 提出一個關鍵 observation：single agent 在 production debugging 是 sequential bottleneck（序列瓶頸） — 它一次只能調查一個假設，但 production incident 的時間壓力要求並行假設驗證。

解法是 multi-agent 架構，讓專業 agent 各司其職並行運作。一個典型的 incident response multi-agent system 長這樣：

Agent	專長	輸出
Trace agent	追跨 service 的 distributed trace	「latency spike 出現在 service B 的 DB call」
DB agent	分析 DB performance、slow query	「過去 10 分鐘有一支 N+1 query」
Deployment agent	review 最近 CI/CD 部署紀錄	「30 分鐘前 service B 上了 commit abc123」
Code diff agent	看那個 commit 的 code change	「abc123 移除了 query 的 index hint」
Customer impact agent	估算影響範圍	「過去 10 分鐘有 12,000 user 受影響」
Coordinator agent	merge 上述發現給結論	「root cause: abc123 commit 的 query regression，建議 rollback」

💡 譯解（MDT 會議類比）：這個架構就是醫院 MDT（multidisciplinary team）會議的軟體版 — 一個 lung cancer 病人的 case，心臟科評估手術風險、腫瘤科出 chemo 方案、放射科出 RT 計畫、病理科確認 diagnosis、社工評估家庭支持，最後 lead physician（coordinator）綜合所有意見開會 merge 結論。每個專家獨立、並行作業，最後一次 merge — 比一個 generalist 順序問每個專家快得多。Production incident 也是多 domain 問題，multi-agent 並行調查比 single agent sequential 快數倍。

但並行的代價是需要 formal coordination protocol 來避免 race condition 與 deadlock — 這也是 MDT 會議要有 chair 主持、有 agenda、有 timing 規則的原因。把這套作出 production-ready 需要罕見的雙重專業：深度 production domain knowledge + sophisticated AI architecture，缺一邊就會做出「會調查但調查錯方向」的系統。Resolve.ai 的賣點就是兩邊都有。

Monday Lecture（11/17）：Incident response and DevOps

Slides: Google Slides 公開連結
講者: Mihail Eric

以下基於 Google Slides 公開內容（TXT export）整理的繁中摘要：

Mihail 開場直接給出 framing 數據：「Coding represents just 30% of engineering time。難搞的 70% 是把 code 跑在 production 裡 — 那裡 complexity / tool silo / knowledge gap / interdependency 全 collide」。

The old world — SRE 的痛點清單 —
- Operational monitoring：on-call、troubleshoot、infrastructure management、security
- Incident resolution 需要從多 source / 多 team 拼湊資訊
- 維護常常已過時的 runbook
- Cloud-native + Kubernetes + 容器化把 data / dependency / complexity 推到新高度
- SRE 因 on-call shift 普遍 burnout
- estimates that downtime and service degradation cost the Global 2000 about $400 billion annually
Infrastructure & DevOps 的核心紀律：Four Golden Signals of monitoring
- Latency：分開追蹤 successful 與 failed request（避免 HTTP 500 失敗讓平均失真，慢 error 特別需要關注）
- Traffic：system demand，通常 req/sec、依系統決定（streaming 算 session、DB 算 transaction）
- Errors：失敗率，包含明確（HTTP 500）、隱含（200 OK 但 wrong content）、policy-based（超過 SLA）。需多層 monitoring 抓不同 failure
- Saturation：「系統滿到什麼程度」，追蹤 CPU / memory / I/O。System 通常在 100% 之前就慢，要訂 safe threshold
- 加上：monitor production trace
凌晨 3:12 的實境演練 — PagerDuty ping 你 DB query 出現 500 spike，怎麼辦？Mihail 給一份 8 步驟 incident playbook：
- Acknowledge & assess：在 PagerDuty acknowledge、確認嚴重度、看 app + DB dashboard、判斷 partial / full outage
- Check Golden Signals（DB first）：connection 是否爆 / 卡 / 突然掉？P95 latency spike？Slow query？timeout / refused / aborted transaction？CPU / IOPS / lock / memory / replication lag？
- Look for what changed：最近 deploy？DB migration？Config / feature flag 改？autoscaling 事件？— 有 correlation 立刻 rollback / revert
- Localize failure：所有 query 都壞 vs 特定 query？read 還是 write？primary 還是 replica？單 shard / instance 還是全部？App-side（pool / timeout）還是 DB-side（load / lock）？
- Apply fast mitigation：connection issue → restart pod / 降 concurrency；DB saturation → 關 heavy job / throttle traffic / read 路由到 replica；bad slow query → 關該 feature / 開 cached degraded mode；replica lag → read 路由到 primary / 重啟 replica；unhealthy node → 只重啟 replica，primary 要 escalate
- Stabilize & monitor：看 500 rate / DB latency / traffic 回正、確認沒 retry storm 與 cascading failure、health check 恢復
- Communicate：每 10-15 分鐘一次簡短 update（issue → action → status）
- Close out：時間軸、root cause、follow-up（query fix、indexing、scaling、retry tuning、capacity review）
Metrics tracked — MTTR（mean time to repair）、被拉進 incident 的 engineer 數量、對 customer 的 reported SLA。
The new AI world — Resolve AI（W9 Friday guest）、Datadog Bits AI Agent、Splunk Observability Assistant。
AI SRE 的特徵 —
- 動態維護 knowledge graph
- 跨 observability stack 與 cloud 的 agentic system
- 即時生成「現在發生什麼」的 narrative、pinpoint likely root cause + 證據、給 prescriptive remediation
- Heavy emphasis on explainability and auditability
What has changed — AI 把 organizational / service-level knowledge scale 出來 — 那些 undocumented dependency、brittle legacy service、只會在 high-stakes incident 浮現的 quirk，不再被孤立在某幾個資深 engineer 腦中。
AI SRE in action 預覽 — observability、working theory、span info、heavily evidence-based、chat 介面做動態查詢。
Limitations —
- 能處理 incident 的複雜度有限
- 現代 production stack 異質性
- 從偵測到實際 remediate code 還很遠（all provider 都先做 root cause analysis 起）
- 好的 RCA 需要好的 monitoring「園藝」（gardening）— 工具雜草不除 AI 也救不了你
- Security 可能變新 attack vector

Key takeaway：SRE 工作流（detect → check golden signals → look for change → localize → mitigate → stabilize → communicate → close out）是個成熟 8 段 playbook。AI agent 不是要重新發明它，是要在前面 5 段大幅縮短時間 — 從凌晨 3:12 page 到 4:00 找到 root cause，從原本的「多 engineer + tool silo + 拼湊資訊」變成「single agent + dynamic knowledge graph + evidence-first」。但 remediation 邊界還在，blast radius 大的動作必須 human authorize。

Friday Lecture（11/21）：Mayank Agarwal + Milind Ganjoo（Resolve）

Speakers:
- Mayank Agarwal, Founder & CTO of Resolve AI（OpenTelemetry creator）
- Milind Ganjoo, Member of Technical Staff at Resolve AI（ex-DeepMind Staff MLE）
Slides: Drive 連結（公開）

以下基於 Drive PDF 公開內容（pdftotext 抽出）整理的繁中摘要。演講題目「Agentic AI for software in production」：

段 1：軟體工程的真實樣貌（不是 idealized）

軟體工程比想像複雜：跨 systems（code、AI、telemetry、cloud、knowledge、security）、跨 teams（application、infra、networking）、跨 workflows（development、deployment、on-call、cost management、compliance、security vulnerability、documentation）。Software engineer 花 70+% 時間在 grunt work：building context、working across tools、evidence gathering、log queries、coordination、application、documentation、compliance、on-call、optimization、deployment、problem solving — 只有少部分時間在 design decision、trade-off、creative work。

段 2：On-call engineer 在凌晨 3:04 被 page 後到底發生什麼

走 timeline：03:04 AM page → 03:20 AM L1 support 接手 → 04:00 AM multiple team escalation（infra / app & product / DB / engineering manager / incident commander）→ 04:45 AM 工程師終於把問題 mitigate → ? AM postmortem。整段過程要靠 runbook、observability、code、infra 拼湊，nearly all manual effort — incident commander 協調人、L1 看 infra、app & product team 看 application、director / comms manager 對外溝通。

段 3：什麼讓 production 對人類（與 model）來說都很難？

跨多 system / tool 的 data silo — 1000s of service across 100s of team、複雜瞬息的 infra（DB、messaging service…）、低層工具看 log / metric / dashboard / feature flag / CI-CD 各自 query language、access mechanism、operational behavior
跨 team / 跨 expertise 的協調 — application engineer、platform engineer、SRE、IT ops、security engineer、support engineer 各有專長，但 context 通常 fragmented and undocumented
直接影響營收與成本 — incident 拖好幾小時或好幾天才解掉、tool 維護成本高、incident regularly 牽涉 20+ engineer、change 難做又會 trigger issue、infrastructure 開支不斷漲、客戶常常先發現問題、只有少數工程師真的懂 production 全貌、新人 onboard 要 3-6 個月、AI 生成的 code 又把問題放大

段 4：AI 要怎麼幫工程師管 production？

Resolve 的 Agent-first approach 結合三件事：(1) understand and operate all your production tools、(2) capture tribal knowledge of your unique system、(3) combine expertise of all your engineers across team。Mayank 把這拆成 3 個設計原則對應到 3 個產品 capability：

設計原則 1：Production 系統複雜且持續變化 → AI 要深度理解 production

接到 code、infra、tool、knowledge
為「你的系統」建模（不是 generic SRE，是你公司的 architecture）
在 graph 內導航到對的 node 蒐證據
像專家一樣操作每個 tool / system
AI 連接 alerts / metric / dashboards / traces / logs / runbooks / change events 等多源訊號

設計原則 2：Knowledge fragmented or undocumented → AI 要 capture tribal knowledge 並逐次變聰明

Capture 公司與 team 的知識
記得 in-the-loop 給的 feedback / teaching
Retrieve context-specific information

設計原則 3：調查需要跨 team 專長 → AI 要結合所有 engineer 的專長

Create investigation plan
並行 pursue 多個 hypothesis（multi-agent 並行查不同方向）
持續 refine plan 到 root cause
讓多人跨 org 邊界協作得到答案

段 5：Lessons learned building AI for prod

這不只是 model 問題 — production 導航需要大量 domain expertise，這些必須 hard-code 進 architecture，單靠 prompt engineering 蓋不出 production AI
Context window 有限，production context 是無限的 — 10M log line 不可能塞進任何 context window。Intelligence 是「knowing WHAT to query, WHEN, and HOW to filter」based on production understanding
跟 tool 工作是 non-trivial 問題 — Raw API 不可用：response 太大、output 雜、為人類設計。必須建 AI system 能 filter noise、回 structured summary、graceful 處理 error、parallel 工作
Eval 跟 product 一樣難 — 建 eval 要重現 production complexity（service、dependency 等），沒 eval 就不能信任 output

段 6：AI is changing software engineering

Mayank 預測「By next year software engineering will look fundamentally different」。Three eras 的 grunt work / creative work 比例：

Models era（純 model API）：grunt work 主導
Agents era：grunt work 還是大頭，但 creative work 占比上升
Closed-loop agents era：grunt work 大幅縮減，creative work 占主導

Key takeaway：Resolve 對 vibe coder 的最大 lesson 不是「怎麼用 AI agent」，是「production AI 為什麼不能靠 prompt engineering 做出來」。它需要 (a) hard-code domain expertise into architecture、(b) 知道何時 / 怎麼 query / filter 而非塞 raw data、(c) 把 messy raw API 包成 AI-friendly tool、(d) eval 跟 product 一樣難建。對未來 AI vertical startup 的 founder 是必看的工程現實 check — production engineering 是 LLM agent 的 killer app，但門檻比一般 chat agent 高一個量級。對自己的 side project，至少先學會這套「接 alert → 檢查 golden signal → 看最近改了什麼 → localize → mitigate」的 8 步驟 playbook，自己當自己的 SRE。

Reading 摘要

篇名	來源	一句話重點
Introduction to SRE	Google SRE Book	SRE = 把營運當軟體工程問題；error budget 量化 dev/ops 衝突，blame-free postmortem 找系統漏洞
Observability Basics	last9.io	Observability 三柱：logs（病歷）+ metrics（生命徵象）+ traces（影像）；context propagation 是關鍵
Kubernetes Troubleshooting with AI	resolve.ai	K8s 三大痛點（alert fatigue、ephemeral context、observability fragmentation）由 AI agent + knowledge graph 解
Your New Autonomous Teammate	resolve.ai	Resolve 產品深度導覽：dynamic knowledge graph + just-in-time runbook + 1 分鐘 root cause + 自動 postmortem
Multi-Agent Systems for AI-Native Engineering	resolve.ai	Single agent 是 sequential bottleneck；multi-agent 並行查 root cause 是 AI-native 工程的核心
Top 5 Benefits of Agentic AI in On-call	resolve.ai	五大好處：消 alert fatigue、活知識、調查一致性、證據式協作、主動找潛在問題

閱讀優先順序：先讀 Introduction to SRE（建立 mental model）→ Observability Basics（基礎工具觀念）→ Multi-Agent Systems（agent 架構）→ Your New Autonomous Teammate（具體商業案例），時間有限的話前三篇必讀。

Assignment

本週原 syllabus 沒列 weekly assignment。建議自學者用 Sentry 或 Better Stack 給自己的 side project 設定 observability 當練習：(1) 開一個帳號（兩家都有 free tier）→ (2) 接到自己的 side project（npm/pip 裝 SDK，5 分鐘）→ (3) 故意製造一個 error 看 dashboard 抓得到嗎 → (4) 設一個 alert（例：error rate > 1%）發到 Discord / email → (5) 觸發後跑一次完整 incident response 流程（看 trace 找 root cause → fix → 寫 postmortem）。預估 2-3 hr，會建立完整 production-ready 的肌肉記憶。

對 Vibe Coder 的應用

W9 是 vibe coder 最容易跳過、但跳過後悔最大的一週。多數人 ship side project 時根本沒 observability，等到出 bug 才發現自己什麼都看不到、只能憑直覺猜。這週的概念套到日常工作流：

第一個 production project 立刻接 Sentry — Sentry 對 vibe coder 是 P1 夯。Free tier 給每月 5K event，足夠 hobby project 用。npm install 後 3 行 code 就接好，自動抓 unhandled exception、含 stack trace、含 source map（看到原 TypeScript 而不是 minified JS）。沒有 Sentry 等於沒儀表板開車
三選一：Sentry / Better Stack / Vercel Analytics —
- Sentry 主攻 error tracking + performance traces，all-rounder。最 P1，預設選它
- Better Stack（原 Logtail + Better Uptime）主攻 log aggregation + uptime monitoring，介面漂亮、SQL-like log query 強。如果你 log 量大選它
- Vercel Analytics 只在 deploy 到 Vercel 時用、focus 在 web vitals + page view，是「最低限觀測」。配 Sentry 用不衝突
Day 1 就加 structured logging — 別用 console.log("user clicked")，用 JSON log 含 request ID / user ID / timestamp / event：logger.info({ user_id, action: "click", item: "checkout" })。Sentry / Better Stack 都會自動 parse JSON log 的 field 變成可篩選的 metadata。改寫 0 成本、回收高得驚人
LLM app 加 Langfuse / Helicone — 如果你 side project 用了 OpenAI / Anthropic API，標準 observability 看不到 token usage / hallucination。Langfuse 開源、self-host 免費、5 分鐘接 SDK，給你完整的 prompt + response + tool call + cost 紀錄。一個月後回頭看，會發現 80% 的 cost 來自你想不到的 5% request — 沒這個資料無法 optimize
Production-ready 的觸發點不是「有用戶」是「你會半夜被 page」 — 多數 vibe project 永遠不會到那個點，所以別過度工程化。只要 (a) error tracking、(b) basic uptime check（Better Stack free tier 給 10 個 monitor），就涵蓋 90% 的 production-ready 需求。Kubernetes、Datadog、PagerDuty 全是 over-engineering — 等到你的 side project 真的有 paying customer 再上
Postmortem 文化套到自己 side project — 每次 production bug 寫 30 字 postmortem（root cause + fix + lesson）存進 repo 的 POSTMORTEMS.md。半年後看會發現自己一直在重蹈覆轍同類 bug，這個 doc 就是個人版的 dynamic knowledge graph

💡 vibe coder 的 Day-1 Quick Win：今天去 sentry.io 開帳號（30 秒）、選你最在意的那個 side project、跟著 5 分鐘 setup wizard 接好 SDK、deploy。下次 bug 不必再從 user 抱怨「我點了沒反應」開始 debug — Sentry 會直接 email 你 stack trace + 出錯時的 user / request context，可能還沒等 user 抱怨就已經修好。這個 ROI 是所有 dev tool 中最高的之一。

上一週：W8 Automated UI and App Building | 下一週：W10 What’s Next