AI-Assisted Assessment of Coding Practices in Modern Code Review

一句話摘要：Google 在內部部署的 LLM-based code review 助手 AutoCommenter，能涵蓋 68% human reviewer 引用的 best practice，其中 66% 屬於傳統 linter 抓不到的範疇，且實測 comment-resolution rate 約 40%。

核心論點（150-200 字繁中）

這篇 AIware ‘24 paper 由 Google + University of Washington 團隊撰寫，報告 AutoCommenter 在 Google 內部給數萬名 developer 使用的開發、部署與評估經驗。AutoCommenter 是 LLM-backed 的 code review assistant，覆蓋 C++ / Java / Python / Go 四種語言，目標是把 best practice violation 檢測自動化、釋放 human reviewer 去看 functionality。三個關鍵 finding：(1) Coverage：AutoCommenter 對 68% human reviewer 常引用的 best practice 都能產出 comment，且其中 66% 是 traditional static analysis（linter）做不到的（例：comment clarity、justified naming exception）；(2) Resolution rate：約 40% AutoCommenter comment 被 author 在 subsequent commit 中實際 resolve（手動抽樣 40 個確認，80% 是真的針對該 comment 修的）；(3) Lessons learned：intrinsic evaluation（offline metrics）與 real-world performance 經常背離；user trust 一旦因少數 false positive 受損就難回復；URL suppression 這類簡單機制就能把 user acceptance 拉到 80%+。

關鍵概念

AutoCommenter — Google 內部 LLM-backed code review assistant，覆蓋 4 種語言、數萬 developer 日常使用。
Best Practice Coverage Beyond Linter（超出 linter 範疇的最佳實務涵蓋） — 66% 的 best practice violation 無法用 traditional static analysis 表達。
Comment Resolution Rate（評論解決率） — Comment 是否真的促成 author 改 code 的比率，AutoCommenter 約 40%，與 human reviewer comment 的 ~50% 接近。
URL Suppression（URL 抑制機制） — 對 false-positive 比率高的 best practice URL 直接停止觸發，是實務上拉高 user acceptance 最有效的單一動作。
Intrinsic vs Extrinsic Evaluation（內在 vs 外在評估） — Offline metric 漂亮不代表 production 會被接受，需要 deployment 後持續 user feedback monitoring。

對 CS146S 的意義

這是 Week 7 唯一的 peer-reviewed academic paper，提供 Diamond / CodeRabbit / Copilot Code Review 等商用工具背後的 academic baseline。Google 證明了三件對課程很關鍵的事：(1) LLM 真的能補足 linter 抓不到的 nuanced practice；(2) 40% resolution rate 是合理 baseline — 課堂在評估 AI reviewer 效果時可用此數字當參照；(3) deployment 成功的關鍵不是 model quality，而是 false positive suppression + user feedback loop 這些 system engineering 問題。

對 Vibe Coder 的 Takeaway

選用 AI reviewer 時，重點不是它「抓到多少 bug」，而是它的 false positive 是否好停（有沒有 suppress 機制 / per-rule mute）。Resolution rate 預期 40-50% 算正常，不要期待 100%。Self-host 一個 LLM reviewer 成本太高，先用現成的 Diamond / CodeRabbit / GitHub Copilot Code Review，把心力花在 prompt 與 rule 微調。

原文連結

AI-Assisted Assessment of Coding Practices in Modern Code Review (arXiv 2405.13565)