{"id":240,"date":"2026-05-16T13:00:30","date_gmt":"2026-05-16T13:00:30","guid":{"rendered":"https:\/\/abrarqasim.com\/blog\/ai-code-review-tools-what-i-run-before-merging-ai-written-prs\/"},"modified":"2026-05-16T13:00:30","modified_gmt":"2026-05-16T13:00:30","slug":"ai-code-review-tools-what-i-run-before-merging-ai-written-prs","status":"publish","type":"post","link":"https:\/\/abrarqasim.com\/blog\/ai-code-review-tools-what-i-run-before-merging-ai-written-prs\/","title":{"rendered":"AI Code Review Tools: What I Run Before Merging AI-Written PRs"},"content":{"rendered":"<p>Short version for the impatient: I run two things on every AI-written PR before I even read it myself, plus one human pass. Semgrep with a small custom ruleset, a second LLM as reviewer with a structured prompt, and a hard rule that I read every external URL and import path with my own eyes. If you want to know why I bother, read on.<\/p>\n<p>I almost shipped a phishing URL last month. The PR was a small one, mostly written by Cursor, fixing a broken redirect in a side project. The code looked fine. Tests passed. I was about to merge when I noticed the redirect target had a Cyrillic &ldquo;\u0430&rdquo; inside the domain. I&rsquo;m not going to publish what the URL was. I will say that the Stack Overflow answer the model had clearly cribbed from did not contain that character.<\/p>\n<p>Two days later I read the <a href=\"https:\/\/arxiv.org\/abs\/2509.02372\" rel=\"nofollow noopener\" target=\"_blank\">Scam2Prompt paper<\/a> and felt extremely seen. The authors ran developer-style prompts through GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3 and found that 4.24% of the resulting code contained a live scam or phishing URL. The benchmark they built on top of that, Innoc2Scam, gets the rate higher across several other models. None of those prompts were adversarial. They were the kind of thing I type into Cursor every day.<\/p>\n<p>So I tightened my workflow. Here&rsquo;s what I actually do now.<\/p>\n<h2 id=\"the-30-second-human-scan-i-do-first\">The 30-second human scan I do first<\/h2>\n<p>Before any tool runs, I look at five things in the diff, in order:<\/p>\n<ol>\n<li>Every external URL in the patch. I literally read each <code>http:\/\/<\/code> and <code>https:\/\/<\/code> string. Stupid simple, but it would have caught my Cyrillic incident.<\/li>\n<li>New imports and dependencies, especially ones I don&rsquo;t recognize by name. The typosquatted-package attack is older than ChatGPT, but AI is great at hallucinating package names that look real.<\/li>\n<li>Any new network call. A <code>fetch<\/code>, <code>axios<\/code>, <code>http.Get<\/code>, or <code>requests.post<\/code> that wasn&rsquo;t there before. If the model added a network call I didn&rsquo;t ask for, I want to know why.<\/li>\n<li>Hardcoded secrets. Keys, tokens, passwords. Models love inlining placeholder secrets that turn out not to be placeholders.<\/li>\n<li>Comments that hand-wave away risk. Things like <code>\/\/ safe to ignore<\/code>, <code>\/\/ works locally<\/code>, or <code>\/\/ fix this later<\/code> in code that touches auth or payments.<\/li>\n<\/ol>\n<p>That&rsquo;s it. Maybe 30 seconds for a small PR. It catches the dumbest 20% of issues for free and primes my brain to be skeptical of the rest.<\/p>\n<h2 id=\"layer-one-semgrep-with-a-small-custom-ruleset\">Layer one: Semgrep with a small custom ruleset<\/h2>\n<p><a href=\"https:\/\/semgrep.dev\" rel=\"nofollow noopener\" target=\"_blank\">Semgrep<\/a> is my main static analyzer. The community rules are fine. The magic, for me, is writing a small number of my own rules tuned to the bugs I keep seeing. Two I find myself reaching for most often:<\/p>\n<pre><code class=\"language-yaml\">rules:\n  - id: suspicious-unicode-in-url\n    message: URL contains a non-ASCII character (possible homoglyph attack)\n    languages: [generic]\n    severity: WARNING\n    patterns:\n      - pattern-regex: 'https?:\/\/[^\\s&quot;'']*[^\\x00-\\x7F][^\\s&quot;'']*'\n\n  - id: hardcoded-redirect-domain\n    message: Hardcoded external domain in a redirect target\n    languages: [javascript, typescript, python, go]\n    severity: WARNING\n    pattern-either:\n      - pattern: redirect(&quot;$URL&quot;)\n      - pattern: location.href = &quot;$URL&quot;\n<\/code><\/pre>\n<p>These two rules took about an hour to write and tune, and they catch a class of issue that almost no off-the-shelf tool flags. Semgrep runs in pre-commit and again in CI on every PR. Total CI cost is a few seconds.<\/p>\n<p>I do not run &ldquo;AI-powered&rdquo; static analyzers as part of CI. I tried two of the big ones last year. Both had false-positive rates high enough that engineers learned to ignore the warnings, which is worse than not running them at all.<\/p>\n<h2 id=\"layer-two-a-second-llm-reviews-the-first-ones-code\">Layer two: a second LLM reviews the first one&rsquo;s code<\/h2>\n<p>This is the controversial layer. I have one AI assistant write the code (usually Cursor or Claude Code), and a different model review the diff before I look at it.<\/p>\n<p>The prompt I use is boring on purpose:<\/p>\n<pre><code>You are reviewing a code change written by another AI assistant.\nDo not rewrite the code. Do not summarize it.\n\nList only the following, with file paths and line numbers:\n1. External URLs added or modified. For each, state whether it\n   points to a domain you do not recognize.\n2. New dependencies or imports. Flag any package name that looks\n   like a typo of a popular package.\n3. Network calls or shell-out calls that were not present before.\n4. Inputs that flow to a sink (file write, DB query, exec) without\n   an obvious validation step.\n5. Tests that assert nothing or assert on the wrong thing.\n\nIf you find none of the above in a category, say &quot;none&quot; for that\ncategory. Do not invent issues.\n<\/code><\/pre>\n<p>The &ldquo;do not invent issues&rdquo; line cuts the rate of hallucinated bugs by a lot. The structured categories keep the reviewer focused on the boring stuff a human is bad at scanning for.<\/p>\n<p>Why a different model? Because the two systems make different mistakes. If Cursor is running on Claude in my setup, I will not ask the same Claude to review it. I use GPT-5 or DeepSeek for the second pass, depending on which I have credits with. Agreement between models on what&rsquo;s a real issue is a useful signal. If both flag the same line, I read it carefully.<\/p>\n<p>This is also why I don&rsquo;t fully trust <a href=\"https:\/\/docs.github.com\/en\/code-security\" rel=\"nofollow noopener\" target=\"_blank\">GitHub&rsquo;s built-in code security tooling<\/a>. The signal is real, but it&rsquo;s one signal from one provider trained on roughly the same internet as the model that wrote your code.<\/p>\n<h2 id=\"layer-three-i-still-read-the-diff\">Layer three: I still read the diff<\/h2>\n<p>I know. Boring. The layers above are pre-filters, not replacements.<\/p>\n<p>What I&rsquo;m actually doing when I read an AI-written diff is different from when I read a human one. With a human PR I&rsquo;m checking intent and style. With an AI PR I&rsquo;m doing forensics. Did the model invent an API that doesn&rsquo;t exist? Did it answer a question I didn&rsquo;t ask? Did it copy a pattern from a different version of the framework I&rsquo;m on?<\/p>\n<p>The thing that helps most is having a clear mental model of which parts of a codebase the AI is good at and which it isn&rsquo;t. Cursor is fine on glue code. It&rsquo;s bad on anything to do with auth, payments, or our internal queue abstraction. I&rsquo;m not going to merge a PR touching those without reading every line myself, no matter how green the CI run is. I wrote about how I pick between assistants in <a href=\"https:\/\/abrarqasim.com\/blog\/cursor-vs-copilot-vs-claude-code-2026-what-i-reach-for\" rel=\"noopener\">my Cursor vs. Copilot vs. Claude Code post<\/a> if you want the long version.<\/p>\n<h2 id=\"what-still-slips-through\">What still slips through<\/h2>\n<p>I want to be honest about the gaps, because this stuff sounds tidier on a blog than it does in practice.<\/p>\n<p>Tests that pass for the wrong reason are still a problem. The model writes a function, writes a test for it, and the test is shaped exactly like the function. Both are wrong in the same direction. CI is green and I don&rsquo;t catch it until something breaks later. My current half-fix is to ask the reviewer LLM to verify the test, but that&rsquo;s the same model-bias problem in a different coat.<\/p>\n<p>Prompt injection in code comments is another thing none of my layers catch reliably. If an AI assistant pulls context from an upstream README or issue that contains an instruction, the resulting code can get steered by an attacker. I haven&rsquo;t been bitten by it yet, but it feels like a matter of time.<\/p>\n<p>Subtle data-flow bugs across files are the third gap. Semgrep can do taint analysis for some languages, but in Go and TypeScript with our messy codebase it&rsquo;s noisy enough that I&rsquo;ve turned it off for now. I just accept that this is the class of bug that will reach production and we&rsquo;ll catch it with observability instead of review.<\/p>\n<h2 id=\"a-pr-checklist-i-actually-use\">A PR checklist I actually use<\/h2>\n<p>For my own sanity, I added the following to the PR template on a couple of projects. The line that matters most is the second one.<\/p>\n<pre><code class=\"language-markdown\">## AI-assisted PR checklist\n- [ ] I have read every external URL in this diff\n- [ ] I have eyeballed every new import or dependency\n- [ ] I ran Semgrep locally (`make sec`)\n- [ ] I ran a second-model review and pasted the output below\n- [ ] I have read the diff end-to-end myself\n<\/code><\/pre>\n<p>Reviewers do not approve until the checkboxes are real. Yes, you can lie on a checkbox. I have to trust my team. But forcing the second-model output to be pasted into the PR makes it inconvenient to skip the step, which is most of what a checklist does anyway.<\/p>\n<h2 id=\"what-id-do-this-week-if-i-were-you\">What I&rsquo;d do this week if I were you<\/h2>\n<p>Pick one rule from the Semgrep snippet above, drop it into your CI, and see what fires on your last month of merged PRs. Even if you don&rsquo;t keep the rule, the audit is interesting. The first time I ran the homoglyph rule across a backlog at a previous client I found three URLs with mixed scripts. None of them had been exploited. All of them had been merged by competent engineers.<\/p>\n<p>If you build a lot of LLM-driven tooling, the case for taking this seriously is in the <a href=\"https:\/\/arxiv.org\/abs\/2509.02372\" rel=\"nofollow noopener\" target=\"_blank\">Scam2Prompt numbers<\/a> and your own gut. If you&rsquo;re a one-person side project shop, the 30-second human scan plus a Semgrep run probably gets you 80% of the way there.<\/p>\n<p>I write up this kind of stuff alongside the rest of my engineering work on <a href=\"https:\/\/abrarqasim.com\/work\" rel=\"noopener\">my portfolio site<\/a>. If you&rsquo;ve found a review layer that catches things mine doesn&rsquo;t, I&rsquo;d love to hear about it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I almost shipped a phishing URL inside an AI-generated PR. Here are the AI code review tools and checks I now run before any AI-written diff merges.<\/p>\n","protected":false},"author":2,"featured_media":239,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"","rank_math_description":"I almost shipped a phishing URL inside an AI-generated PR. Here are the AI code review tools and checks I now run before any AI-written diff merges.","rank_math_focus_keyword":"ai code review tools","rank_math_canonical_url":"","rank_math_robots":"","footnotes":""},"categories":[4,184],"tags":[28,279,277,98,154,278,280],"class_list":["post-240","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-developer-tools","tag-ai","tag-ai-coding-assistants","tag-code-review","tag-developer-tools","tag-security","tag-semgrep","tag-static-analysis"],"_links":{"self":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/240","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/comments?post=240"}],"version-history":[{"count":0,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/posts\/240\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media\/239"}],"wp:attachment":[{"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/media?parent=240"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/categories?post=240"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/abrarqasim.com\/blog\/wp-json\/wp\/v2\/tags?post=240"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}