The PR Reviewer I Couldn't Buy

There is now a recognizable genre of blog post in which a developer explains, more in sorrow than in anger, why they're leaving GitHub. This one is also that. Mitchell Hashimoto wrote his last week.¹ Andrew Kelley moved Zig off in November.²³ Drew DeVault has been writing variations of the argument since at least 2020.⁴⁵ Bradley Kuhn and the Software Freedom Conservancy have been calling for the move since 2022.⁶ nicole tietz-sokolskaya did the individual-developer version that same year.⁷ As Lord.io put it earlier this year, "if you subscribe to many programming blogs, chances are you've come across a post describing someone's move off GitHub."⁸ I am late to a chorus, not starting one. The only reason to add another voice is that the chorus is the point.

What I do have to add is what came after.

I'd been quietly relying on CodeRabbit on my personal GitHub repos for a couple of weeks, after colleagues who'd been using it for months on private work pointed me to it. It is, plainly, an excellent product — visibly better and substantially cheaper than the homegrown Claude/OpenAI-reviewer-in-CI rig I'd been operating, where the cloud-API bill was creeping into territory no individual ought to be paying out of pocket for hobby code. CodeRabbit's pricing is fair. I'd gotten to the point of quietly trusting it.

CodeRabbit doesn't run on Forgejo. Nothing comparable does. The major commercial AI reviewers — CodeRabbit, Greptile, Graphite's Graphite Chat (formerly Diamond), Cursor's Bugbot, and Copilot's review feature — collectively support every other significant forge: GitHub, plus various combinations of GitHub Enterprise Server, GitLab.com, self-managed GitLab, Azure DevOps, and Bitbucket. CodeRabbit's documentation lists seven supported platforms,⁹ which appears to be the most coverage of any of them. None runs on Gitea or Forgejo. The closest open-source alternative, Qodo's PR-Agent, has a Gitea provider and may in principle work against Forgejo (which is a hard fork of Gitea), though I haven't tested it. PR-Agent describes itself as "each tool uses a single LLM call (~30 seconds, low cost)," with no bundled linter pipeline, no sandboxed tool execution, no persistent learnings store, no separate verification pass. That's a perfectly reasonable design for a project that values simplicity. It isn't the shape of the systems I missed.

So I typed something like "what would it take to reimplement CodeRabbit for Forgejo" into Claude Code, expecting to come back to a list of reasons why this was a multi-quarter team-of-engineers project I had no business attempting alone. Then I went to bed. Twelve hours later I woke up to a working PR reviewer running against this very repository, posting inline comments on its own commits.

This post covers four things: why I'm leaving GitHub, in conversation with the writers above; how a language model reconstructed CodeRabbit's architecture from its team's own public engineering writing; what came out of that, which is auto_review; and the techniques CodeRabbit's team got right, which I had to put together to make any of it work and which generalize well beyond code review. The GitHub argument comes first. If you only want the architectural lessons, jump to "What Generalizes."

Why I'm Going

The structural argument has been made better, and earlier, by other people. Cory Doctorow named the pattern: "enshittification," the arc by which platforms first treat users well, then abuse users to please business customers, then abuse business customers to please shareholders.¹⁰ He has since explicitly licensed the term beyond digital platforms.¹¹ pablotron applied it directly to GitHub on April 30: "Degrading the quality of a platform to maximize short-term shareholder profit is textbook enshittification."¹² Jonas Hietala named it on April 28.¹³ Jared Norman put the consequence into the cleanest one-line statement of what GitHub feels like now, in his August 2025 piece "The 'Git' 'Hub' Part Is No Longer the Product."¹⁴ The hosting-and-collaboration layer is no longer what GitHub is optimizing. Copilot is.

I won't relitigate that read; the writers above did it well. What I'll do is note the public-record evidence I read and acted on, because the timing is part of why I'm writing now.

The decline shows up early. A June 2020 analysis by StatusGator,¹⁵ comparing the twenty-four months before Microsoft's June 2018 acquisition announcement to the twenty-four months after, found a 41% increase in published status-page incidents and a 97% increase in incident-minutes — 12,074 minutes of downtime in the post-acquisition window versus 6,110 in the pre-acquisition window. The trend has not reversed. According to a reconstructed monthly-uptime tracker maintained by Marek Šuppa,¹⁶ cited by The Register,¹⁷ GitHub's monthly uptime dropped below 90% in 2025 and has continued to slide; April 2026 is reported below 85%. On April 28, GitHub's CTO Vlad Fedorov posted a public statement¹⁸ acknowledging that recent incidents "are not acceptable" and apologizing for the impact, while attributing the underlying load increase to "agentic development workflows" that have "accelerated sharply" since late 2025. That explanation isn't wrong, but it isn't a defense either. "Our customers are using our product more" describes GitHub's job, not an excuse for failing at it. And this is Microsoft, not a seed-funded startup encountering scale for the first time. The institutional memory and operational playbooks needed to anticipate a load curve like this are supposed to be there.

The reliability story is, in some ways, the least of it. Consider the week immediately before I started writing this:

April 23: a merge-queue regression silently reverted code in 2,092 pull requests across 230 repositories.¹⁹¹⁸ Squash-merge groups containing more than one PR produced incorrect three-way merges; previously merged commits were retroactively un-merged into the resulting commit. The bug landed because a feature flag's gating was incomplete. It went undetected by GitHub's own monitoring because the symptom wasn't unavailability but data-integrity corruption, and was only found when customers started filing tickets. The mental contract you have with your version-control host — that a merge means what the merge says — broke for those teams, retroactively, without warning.
April 27 through May 1: GitHub's Elasticsearch subsystem fell over. Per GitHub's own April 28 statement,¹⁸ the cluster "became overloaded (likely due to a botnet attack) and stopped returning search results." The user-visible recovery,²⁰ spanning roughly 62 hours from late on April 28 until 04:15 UTC on May 1, required GitHub to reindex pull-request data and repair missing and stale search records. For nearly three days, any UI surface that depended on PR search returned incomplete or empty results. The data wasn't lost, just unfindable, which on a forge is functionally the same thing for triage, review, and CI.
April 29: the TanStack supply-chain incident became public.²¹ According to Socket's research, the holder of the unscoped tanstack name on npm (a registry GitHub has owned since its 2020 acquisition) used it to publish four malicious versions (2.0.4 through 2.0.7) in a 27-minute window, each running a postinstall script that exfiltrated .env, .env.local, .env.production, and similar files to a Svix-hosted endpoint operated by the attacker. Socket's reporting includes a public statement from Tanner Linsley, maintainer of the legitimate @tanstack/* packages, that the squatter had previously demanded $10,000 from him and that TanStack had "repeatedly tried, unsuccessfully, to get @npmjs to address the situation." The package shipped, developers installed it, secrets walked out the front door.

You can read those three things as bad luck in a single bad week. I read them as a pattern. A registry that won't enforce its own brand-impersonation rules until something explodes. A forge whose merge primitives can corrupt without alerting. A search system that can fall over for half a week. Each is a single failure. Together they describe an institution whose ratio of things shipping to things being maintained has visibly slipped, on the watch of an owner whose other recent flagships, Windows and Xbox, are getting the same kind of public eulogy from longtime users.²² The simplest explanation is the older one, the one Doctorow and Norman and pablotron already named: large companies acquire beloved products, optimize them for the acquirer's quarterly story rather than the user's daily one, and the slope only goes one way.

I am, for what it's worth, GitHub user ID 1654. I signed up on February 28, 2008, while the site was still in beta and roughly six weeks before its public launch in April 2008.²³ That isn't a credential, it's a tenure. Hashimoto opened his post by identifying as user 1299; mine is a few hundred numbers later. He framed his post as a breakup. I don't have anything to add to that framing that he didn't say better.¹²⁴ What I will say is that the four-digit-user-ID people are very much saying this on the record, by name — Hashimoto, Kelley, DeVault, Kuhn, tietz-sokolskaya, Reece, Peters, Hietala, pablotron, Quirk²⁵²⁶²⁷ — and one more voice doesn't shift the conversation, but the conversation is the mechanism. So: one more voice.

What finally moved me to act was nothing dramatic. The bill no longer matched what I was getting. The platform had grown bloated in a way I no longer wanted to navigate. And I'd run out of patience for the uptime and security failures, which had stopped looking like incidents and started looking like a posture. I'd already started the migration in the days before the merge-queue regression and the Elasticsearch outage; that week made it easier to keep going.

So I stood up Forgejo on a NixOS box of mine. Forgejo is a community-led fork of Gitea, created in late 2022 after Gitea's then-lead maintainer transferred the project's trademarks and operations into a for-profit corporation and began moving toward an open-core model.²⁸ The non-profit Codeberg e.V. is now the de-jure lead maintainer. The politics of the project line up with mine: federation-friendly, sustainable governance, no VC pressure, and a structural answer to "what happens if the maintainer pivots to enterprise SaaS." The fork itself is the answer. The setup, declarative and reproducible from a flake, is small enough I won't describe it here. My git lives at git.johnwilger.com now. The migration is in progress, not finished.

The catch was that I'd just made my entire toolchain unrecognizable to every closed-source AI service I'd been using on top of GitHub. The things that had been quietly making me better at my job didn't follow me out the door.

The Question I Expected To Lose

My initial prompt was lazy. I asked something to the effect of: is it feasible to build a CodeRabbit-equivalent for Forgejo, and if so, what would the architecture look like? I was expecting the model to push back on scope. Instead, it asked permission to dig, and then it dug.

What it surfaced over the next hour was the extent of CodeRabbit's own public engineering writing. Their founders and engineers had, by the time I asked, given a lot of public interviews about exactly how their system worked. There was a Software Engineering Daily interview with Harjot Gill²⁹ walking through the multi-model pipeline architecture, the reasoning trail, and the context-window strategy. There was a Google Cloud case study³⁰ detailing the use of Cloud Run for sandboxed execution: Cloud Tasks for queuing, Jailkit for second-layer process isolation inside each microVM, dedicated service identities, "20+ linters and security scanners." That same post — titled "50% faster merge and 50% fewer bugs" — was also where the headline outcome numbers came from. There was a LanceDB case study³¹ on the use of embedded vector storage to query "tens of thousands of tables" of PR, issue, code-dependency, and "tribal learnings" data with sub-second latency. There was an OpenAI customer story (titled "Shipping code faster with o3, o4-mini, and GPT-4.1") naming the specific model mix CodeRabbit was running on.³² And there was Kudelski Security's August 2025 write-up³³ of a vulnerability they had responsibly disclosed to CodeRabbit in January 2025, presented at Black Hat USA, in which the technical detail inadvertently confirmed dozens of CodeRabbit internals — particularly that at the time of disclosure, Rubocop was running outside the sandbox, and the .rubocop.yml require: - ./ext.rb extension directive was the vector the researchers used to load arbitrary Ruby and, by way of an environment variable containing CodeRabbit's GitHub-App private key, demonstrate read and write access to approximately one million repositories. CodeRabbit's public response³⁴ reports that Rubocop was disabled within an hour, credentials rotated within three, the tool relocated into the sandbox within twelve, and that the post-incident investigation found no evidence customer data was accessed.

In other words, you could triangulate the system from public engineering writing. Not perfectly, not down to the last prompt, but down to the architectural decisions that mattered: the multi-stage pipeline rather than a pure ReAct loop, the cheap-tier triage model, the strict-JSON-schema review output with a self-heal validator, the cheap-tier verifier that drops unfounded findings, the per-repo "learnings" memory store, the bundled linter fan-out, the inline-comment posting flow against the host's review API. The model assembled a feasibility study — phased build plan, cost estimates, threat model, file-by-file create order, explicit out-of-scope list. I read it skeptical, then less so once I'd checked the references it had pulled. They were real. The synthesis was sharp. The plan was something I could critique on its own terms, push back on, sharpen.

The phased plan estimated something like eighteen to twenty-four weeks of focused work to a CodeRabbit-equivalent, with a useful MVP at five weeks. That's roughly what a careful senior engineer would produce. It was the right answer to the wrong question. The right question turned out to be: how fast can we put a thin end-to-end thread together and find out if any of this actually works?

Twelve Hours

I started the workspace at 7:51 PM on April 30. The first commit lays down the eleven-crate Rust workspace, the Dockerfile, the docker-compose template, the Forgejo Actions CI workflow, and the feasibility study itself, with a green cargo check.

By 7:55 PM, the Forgejo client was real: get_pr_diff, list_changed_files, create_review, post_commit_status, with wiremock-backed integration tests covering the API surface. By 8:00 PM, the LLM router was wired against an OpenAI-compatible provider, with the tier-routing abstraction in place so a deployment can mix a local Ollama embedding model with a hosted reasoning model behind one config. By 8:06 PM, the strict-JSON-schema reviewer was emitting validated review comments through a self-heal loop, with the test commit explicitly recording fifteen failing tests landed before any of the implementation. By 8:18 PM, the orchestrator was wired through the gateway and a webhook actually triggered a review pipeline.

From git init to a webhook-triggered review pipeline in roughly twenty-seven minutes. The next several hours added the workspace clone, the linter fan-out, language-specific routing, the verification pass, and a lot more tests. By the next morning, auto_review was reviewing its own pull requests, and it has been ever since. As of this writing it has reviewed PRs #32 through #38 of its own repository, posting inline comments and structured findings against its own diffs.

The cost of getting from "is this even possible" to "I am looking at it work" collapsed to under a day. The feasibility study was the first artifact, not the last. The plan stayed real — phased, defensible, still being worked through — but I no longer had to wait until I'd finished it to find out whether the foundational claims held. I could ship the proof and the plan in the same evening, and the plan stopped being speculation and became a punch list.

What's Actually Running

The system is alpha. Consider this a status update, not a marketing page. As of today it does roughly what the feasibility study said it would:

A gateway binary takes Forgejo webhooks, HMAC-verifies them with the shared secret, and enqueues jobs.
An orchestrator runs each PR through a deterministic pipeline: shallow clone, LLM-driven triage that skips lockfile-only PRs and routes trivial files away from the reasoning tier, fan-out across 45 bundled linters (each one sandboxed via Podman with --network=none --read-only --cap-drop=ALL --security-opt=no-new-privileges and a wall-clock budget — the Kudelski lesson encoded as a hard architectural rule), tree-sitter symbol extraction, embedding-based retrieval over a persistent learnings store, a strict-JSON-schema review prompt with self-heal retry, and a cheap-tier verifier that drops findings the LLM can't ground in cited code.
An @auto_review chat handler accepts natural-language commands on inline review threads — re-review, remember <text>, forget <id>, autofix, docstring, tests, plus free-form questions — with a polling fallback for the gap where Forgejo (inheriting from Gitea) doesn't yet ship a pull_request_review_comment webhook. The open feature request is Gitea issue #26023.³⁵ Without the webhook, the bot polls.
A bench CLI subcommand replays PR fixtures through the review path against a chosen model and scores precision and recall against labeled ground truth, so model upgrades and prompt changes can be measured against a regression corpus instead of vibes.
The whole thing builds and tests through one nix flake check, and CI runs the same derivations bit-for-bit.

There are around a thousand tests. The pipeline is reviewing its own PRs. The threat model and ADRs are in docs/.

What Generalizes

This section is a case study, not a discovery log. Most of the techniques below are how I'd build any production agentic system, and most of them weren't original to the CodeRabbit team either. What that team did was put the combination together at the scale of a real product, against the trade-offs I would have made myself, and write about it in enough detail that the architecture is reproducible from public sources. I'd recommend reading their engineering writing directly rather than this summary; the list below is what struck me as worth highlighting from the reimplementation, with credit to them throughout. The one part of the architecture that actually changed how I think about agentic systems is the section on context orchestration further down. If you only have time for one item, that's the one.

1. A multi-stage pipeline beats a pure ReAct loop, by a lot. The naïve agentic shape is "give the LLM a set of tools, prompt it with a goal, let it loop until it claims it's done." It's also the shape that runs five minutes per request, costs ten times what it should, returns answers you can't reproduce, and falls over the moment one of its turns hallucinates a tool call. CodeRabbit's writing is firm on this. Production agentic systems should be deterministic pipelines with agentic spans embedded inside them, not agents with deterministic helpers hanging off them. In auto_review, the orchestrator does the routing, the cloning, the linter fan-out, the context curation, the dedup, and the result-posting in straight Rust. The LLM is invoked only at the points where actual judgement is required: triage, summarization, review, verification. Each invocation has a defined input shape, a defined output shape, and a defined retry policy. The "agent" lives inside one or two stages of the pipeline, not draped over the whole thing. This makes the system cheaper, faster, debuggable, and rerunnable. It also lets you mix model tiers (see below), which a pure ReAct loop structurally can't do, because the same loop is doing all the work.

2. Two-tier model routing buys back a substantial fraction of inference cost. Most of what looks like "agent work" isn't actually hard. It's classification, summarization, routing, format conversion. None of it needs a reasoning model. All of it is well within the capability of a small fast model that costs an order of magnitude less per token. The CodeRabbit team writes about this directly; their case studies argue that triaging trivial cases away from the reasoning tier is where most of the cost reduction comes from. In auto_review, the cheap-tier triage classifier looks at every PR and decides which files are trivial (lockfile updates, formatting-only changes, generated code) versus which warrant the reasoning tier. Trivial files are reviewed by the cheap model or skipped; the rest get the expensive treatment. The same pattern recurs at verification: a cheap-tier model passes over each finding the reasoning tier produced and asks, "does the cited code actually do what the finding claims?" The cheap model's errors are bounded and asymmetric — it drops false positives, which is what you want. The reasoning tier is no longer being asked to spend tokens on cases beneath its dignity. A cheap classifier deciding whether to escalate beats a single expensive model handling everything.

3. Strict-JSON-schema output with a bounded self-heal loop is the only way to talk to a model in production. "JSON mode" alone isn't enough. You want a real schema validator on the way out (pydantic, serde, whatever your language gives you), and you want the validation error fed back to the model on retry. In auto_review, every LLM call that produces structured output is wrapped in a self-heal pattern: validate, on failure feed the failure message back as the next user turn, retry up to N times, then fail loudly. Without it, structured-output parse errors will dominate your error budget. With the loop, they're rare. The bounded retry is non-negotiable; you don't want unbounded retry budgets in a paid-per-token system. The loud failure when the budget is exhausted is the right behavior. Better to surface a real error than ship corrupted state. The general principle: the LLM is an external system at a trust boundary, and trust boundaries get parsers, validators, and retries.

4. Verification belongs in a separate pass, run by a different (cheaper) model. The verifier-as-separate-pass design is the part of the CodeRabbit architecture that's most counterintuitive on first read, particularly for anyone coming from a single-model mindset. Most people start with the intuition that verification is part of the generation step — the same model that produces the answer should also check it. That turns out to be wrong in two ways. The model that produced the answer is biased toward defending it. And generation is the expensive step while verification is the cheap one, so you want to do them with different tools. In auto_review, the reasoning tier produces a list of candidate findings against a strict schema. The cheap tier then takes each finding individually, in isolation, and is asked: "given the cited file, line range, and snippet, does the claim in the finding actually hold?" Findings that fail this check get dropped. Hallucinated criticisms — the dominant failure mode of LLM code review — disappear before they're posted. In any agentic system, separate the generative model from the validating model, and make the validator as adversarial as you can afford. Skepticism is cheap; generation is expensive. Front-load the skepticism.

5. Context orchestration is the real engineering problem. This is the lesson I'd flag if you only have time for one. The CodeRabbit team's writing on how they decide what data to put in front of the model — from which sources, in which order, at which stage — is the most coherent statement of this I've found in the public agentic-systems literature. "Just embed the codebase, pull top-K, and prompt the model" is the demo version of retrieval-augmented generation, and it's maybe a third of what a production system actually needs. The LanceDB case study³¹ makes the point clearly: the prompt that goes to the reasoning model isn't a chunk of vector-retrieved text, it's a curated assembly from structurally different sources. auto_review's context-curation stage reflects this directly. The prompt is built from tree-sitter symbol queries (better than embedding retrieval for "where is this function defined and where is it used"), embedding retrieval over a persistent learnings store (capturing team-specific judgement no public model has ever seen), a co-change graph (surfacing files that historically move together), and the diff itself. Pure vector retrieval over the codebase as a flat blob is shallow and noisy; the production pattern is multi-modal retrieval converging on a curated context the generative model can use. Picking the right retrieval modality for each kind of question the model needs answered, then composing them into a single prompt that fits the context budget, is where the gap between a demo and a working product is widest.

6. Memory is a separate store from context, and it must be small. A "learnings" store — per-context, structured, persistently writable, queryable by similarity — is what makes the agent get better over time without retraining anything. In auto_review, the chat handler accepts remember <text> and forget <id> so a maintainer can teach the bot about local conventions, and the learnings are surfaced into the review prompt for future PRs in that repo. The total token count of the learnings store added to a given prompt is capped low, single-digit kilobytes, because more is worse, not better. Agents need memory, but memory isn't context window. Structured, named, scoped, queryable, deletable memory is a different problem from "stuff more of the chat into the prompt," and the two shouldn't be conflated. The systems that get this right are the ones where memory is a first-class object the user can audit and edit.

7. Sandbox the tools or you have a remote-code-execution vulnerability. This is the Kudelski lesson, generalized. The moment your agent can issue tool calls that execute code (bash, eval, "run the linter on this PR's config file"), you're accepting attacker-controlled input through that surface, because any user who can submit a PR can put adversarial content into a file the tool will then process. The case study CodeRabbit and Kudelski Security have both publicly written up: at the time of the January 2025 disclosure, Rubocop was running outside CodeRabbit's sandbox, and Rubocop's documented require: - ./ext.rb extension directive in .rubocop.yml loaded arbitrary Ruby, which Kudelski's researchers used to extract CodeRabbit's GitHub-App private key and demonstrate write access to approximately a million repositories. CodeRabbit's response was fast and detailed and they've written it up publicly.³⁴ In auto_review, every linter invocation and every LLM-issued shell call goes through ar-sandbox, which in production runs Podman with --network=none --read-only --cap-drop=ALL --security-opt=no-new-privileges and a capped wall-clock budget. There's a less-safe DirectSandbox available for local development, but the gateway refuses to start with it unless an operator explicitly opts out at boot. Sandbox at the OS level, not at the prompt level. "We told the model not to do that" isn't a security control; the kernel is.

8. Test against fakes, not the real model. LLM tests against the live API are slow, flaky, expensive, and non-deterministic. LLM tests with HTTP-level mocks are brittle to vendor changes and miss most of the interesting failure modes. The pattern that works in auto_review is CannedProvider and ScriptedProvider: fakes implemented at the same trait the real providers implement, returning canned responses in the test's order, with assertions over the prompts the system would have sent. This is fast, deterministic, runs offline, and exercises the actual control flow. In an agentic system, the LLM is a dependency. Inject it. Fake it for tests. Don't test against the real one in CI.

9. Make the inference layer provider-agnostic from day one. The OpenAI chat-completion API is the de-facto standard, and the right abstraction is a small trait with a couple of methods (chat-completion, embedding) and an enum of model tiers (Reasoning, Cheap, Embedding) that maps to provider implementations. auto_review's ar-llm::Router does exactly this, and it pays for itself the first time someone wants to mix a local Ollama embedding model with a cloud reasoning model. The cost of the abstraction is a few hundred lines; the cost of not having it is a fork every time you switch providers.

10. A regression bench is the only honest way to upgrade prompts or models. "It feels better" isn't a release criterion. auto_review ships a bench CLI subcommand that replays a fixture set of PRs through the full review path against a chosen model, scores precision and recall against labeled ground truth, and prints a report. When I bump a model snapshot, I run the bench. When I edit the review prompt, I run the bench. LLM behavior changes under upgrades and edits in non-obvious ways, and the only protection against silently regressing is to measure before and after. If your system doesn't have a bench, you're flying blind.

None of these techniques is novel by itself, and the team most responsible for putting them together in production already wrote about them. What the case study makes concrete, by example, is what each piece prevents in isolation: pull the verifier and you get hallucinated criticisms, pull the schema validator and you get parse-error 500s, pull the sandbox and you get an RCE, pull the bench and you get silent regressions. They interlock. The interlock is the architecture. If you're building anything agentic and ambitious, this is the shape I'd start with, regardless of whether your problem is code review or something else entirely.

Sovereignty as a First-Class Feature

The piece of this that the closed-source incumbents structurally can't match — not because they're worse companies, but because they're shipping a different product — is self-hostability.

auto_review is a single binary you run next to your Forgejo instance. The LLM router is provider-agnostic. Anything that speaks the OpenAI chat-completion protocol works: Ollama, vLLM, OpenRouter, Together, Groq, hosted OpenAI / Anthropic, your own local server with a 32B-class model and no internet. There's an explicit profile in the feasibility study and a verification target in the threat model for the all-local case: a full review pipeline completing on a workstation with Ollama and no egress. No code, no diffs, no embeddings, no learnings ever leave your machine.

You can also mix and match. Local embeddings (cheap, fast on consumer GPUs) plus a cloud reasoning tier for the heavy lift. A BYOK setup where your reviewer hits your OpenAI account against your prompts inside your sandbox. Whatever your data-residency and budget posture demand.

This isn't a virtue claim about hosted services, which I'm using and recommending. It's that the two products have structurally different shapes. CodeRabbit is a hosted service because that's the right shape for what they're shipping; that's why it's frictionless to adopt and why their team can iterate on the model and prompt mix without your involvement. auto_review happens to be self-hosted because that's the shape Forgejo users already have infrastructure for, and because some of the people who'd want it are running local-only inference for entirely defensible reasons. Neither is "better." They cover different cases.

What I'm Releasing, and Why

auto_review is pre-0.1.0. Alpha software with rough edges. The threat model is written down but not yet fully exercised against an adversarial PR corpus. The deployment story works on my hardware and probably not yet on yours without some hand-holding. Use it at your own risk. I'm working this week and next to push it to a usable, secure, default-deployable first release: finishing the production-polish items in the threat model, running a precision-recall benchmark against a labelled corpus of historical PRs, hardening the operator path, settling the license. The repository is open at git.johnwilger.com/jwilger/auto_review and the issues are public. If you want to wait for the 0.1.0 tag, that's reasonable.

auto_review will be open source under a permissive license. Once it tags 0.1.0, if you're already on Forgejo (or Gitea, which uses mostly the same API), you'll be able to nix run or docker compose up your own instance next to your forge and have it working on real PRs the same afternoon. If you're not on Forgejo, this is one more reason to consider it.

I'm not pretending this is a CodeRabbit replacement yet. CodeRabbit has years of accumulated craft, an enormous corpus of real-world tuning, and a team that has thought about every part of this longer and more carefully than I have. If you're on a platform CodeRabbit supports, use CodeRabbit. I do, where I can. What I'm claiming is narrower: the architecture is reproducible from public engineering writing, the core pieces work, the gap between "what I have" and "what they have" is closable by ordinary work over time, and I'm going to close it in the open with help from anyone who wants to pitch in.

Tying Off The Threads

I'm leaving GitHub because the platform stopped serving me, the public record reads like a textbook case of the pattern Doctorow named, and the writers I cited at the top of this post made the structural argument better than I could. The April merge-queue regression that retroactively un-merged two thousand pull requests, the days-long Elasticsearch outage, the malicious npm brand-squat, the slipping uptime, the public apology from the CTO — none of that is going to spontaneously reverse. The owner has thirty years of experience operating things at scale and has chosen not to use it here. I'm done waiting for that to change, and I'm joining a chorus that has been singing this for years.

The architectural moat around closed-source developer tools, on the other hand, turns out to be shallower than it used to be. The knowledge required to build the next CodeRabbit is, by and large, public: case studies, customer stories, security disclosures, conference talks, podcasts. Vendors talk about their systems because their marketing requires them to. A capable model with web access can read all of it at once and triangulate the architectural decisions that mattered, sharply enough to draft a real plan a single engineer can critique on its own terms. The remaining moat is execution, taste, and care, and that's where teams like CodeRabbit's are going to keep being good at this.

What I think small focused efforts can do is fill the gaps the commercial market doesn't cover — Forgejo, Gitea, anyone running a local-only inference profile — without trying to compete on the same axis. auto_review will live next to your forge. The architecture is the sovereignty. An OpenAI-compatible inference abstraction means your prompts, your diffs, your embeddings, and your accumulated learnings don't have to leave your machine if you don't want them to. The gates between you and that workflow are no longer technical; they're configuration.

The techniques that made any of this work — pipelines over loops, two-tier model routing, schema validation with bounded retry, separate cheap-tier verification, multi-source context orchestration, sandboxed tools, fakes for tests, regression benches — generalize. They're how I'd build the next agentic system, whatever it does. The CodeRabbit team's public engineering writing is the most coherent existing statement of this architecture I've come across, which is why this article cites from it as much as it does. I think a lot of next-generation agentic systems are going to converge on roughly this shape, whether we coordinate on it or not, because the alternatives turn out to be fragile in production. If this essay shortens that convergence by one project for one team, that's worth the writing.

The feasibility study said half a year. The first end-to-end thread took an evening. I don't want to overgeneralize from one project, but I don't want to underclaim either. If you've been waiting to find out whether the tool you wish existed is buildable by you, the answer is, more often than it used to be, yes.

The opinions, recommendations, and analysis in this article are my own and do not represent the official positions, views, or recommendations of my employer, Artium AI. Any errors are mine. Any praise is also mine, and I will gracefully accept it.

References

Hashimoto, Mitchell. "Ghostty Is Leaving GitHub." mitchellh.com, April 28, 2026. https://mitchellh.com/writing/ghostty-leaving-github.

Kelley, Andrew. "Migrating from GitHub to Codeberg." Zig Software Foundation, November 26, 2025. https://ziglang.org/news/migrating-from-github-to-codeberg/.

Claburn, Thomas. "Zig quits GitHub, says Microsoft's AI obsession has ruined the service." The Register, December 2, 2025. https://www.theregister.com/2025/12/02/zig_quits_github_microsoft_ai_obsession/.

⁴

DeVault, Drew. "Embrace, extend, and finally extinguish — Microsoft plays their hand." drewdevault.com, August 27, 2020. https://drewdevault.com/blog/Microsoft-plays-their-hand/.

⁵

DeVault, Drew. "GitHub Copilot and open source laundering." drewdevault.com, June 23, 2022. https://drewdevault.com/blog/Copilot-GPL-washing/.

⁶

Kuhn, Bradley M., and Denver Gingerich. "Give Up GitHub: The Time Has Come!" Software Freedom Conservancy, June 30, 2022. https://sfconservancy.org/blog/2022/jun/30/give-up-github-launch/.

⁷

tietz-sokolskaya, nicole. "I'm moving my projects off GitHub." ntietz.com, November 16, 2022. https://ntietz.com/blog/moving-off-github/.

⁸

Lord, Lord.io. "A Programmer's Guide to Leaving GitHub." lord.io, January 23, 2026. https://lord.io/leaving-github/.

⁹

CodeRabbit. "Repository Configuration: Overview." CodeRabbit Documentation. Accessed May 2, 2026. https://docs.coderabbit.ai/platforms/overview.

¹⁰

Doctorow, Cory. "Tiktok's enshittification." Pluralistic, January 21, 2023. https://pluralistic.net/2023/01/21/potemkin-ai/. (The term first appeared in Doctorow's "Social Quitting," a Medium post dated November 15, 2022; the Pluralistic essay is the canonical formulation.)

¹¹

Doctorow, Cory. "The enshittification multiverse." Pluralistic, April 27, 2026. https://pluralistic.net/2026/04/27/analogs-and-analogies/.

¹²

pablotron. "GitHub Enshittification." pablotron.org, April 30, 2026. https://pablotron.org/2026/04/30/github-enshittification/.

¹³

Hietala, Jonas. "From GitHub to Codeberg/Forgejo." jonashietala.se, April 28, 2026. https://www.jonashietala.se/blog/2026/04/28/from_github_to_codebergforgejo/.

¹⁴

Norman, Jared. "The 'Git' 'Hub' Part Is No Longer the Product." jardo.dev, August 11, 2025. https://jardo.dev/the-git-hub-part-is-no-longer-the-product.

¹⁵

Bartlett, Colin. "Has GitHub Been Down More Since Its Acquisition by Microsoft?" StatusGator Blog, June 4, 2020 (updated). https://statusgator.com/blog/has-github-been-down-more-since-its-acquisition-by-microsoft/.

¹⁶

Šuppa, Marek. "The Missing GitHub Status Page." Accessed May 2, 2026. https://mrshu.github.io/github-statuses/.

¹⁷

Speed, Richard. "GitHub says sorry and vows to do better as uptime slips and devs complain." The Register, April 29, 2026. https://www.theregister.com/2026/04/29/github_says_sorry_and_says/.

¹⁸

Fedorov, Vlad. "An update on GitHub availability." The GitHub Blog, April 28, 2026. https://github.blog/news-insights/company-news/an-update-on-github-availability/.

¹⁹

GitHub Community. "[2026-04-23] Incident Thread #193645." Discussion, April 23, 2026. https://github.com/orgs/community/discussions/193645.

²⁰

IsDown. "GitHub: Incomplete pull request results in repositories." Incident record, resolved May 1, 2026. https://isdown.app/status/github/incidents/577936-incomplete-pull-request-results-in-repositories.

²¹

Socket Research Team. "Malicious npm Package Brand-Squats TanStack to Exfiltrate Environment Variables." Socket Blog, April 29, 2026. https://socket.dev/blog/tanstack-brandsquat-compromise.

²²

Windows Central. "'GitHub is failing me, every single day, and it is personal': After Xbox and Windows, now GITHUB is in crisis — Microsoft, what are you doing?" Windows Central, 2026. https://www.windowscentral.com/microsoft/github-is-failing-me-every-single-day-and-it-is-personal-after-xbox-and-windows-now-github-is-in-crisis-microsoft-what-are-you-doing.

²³

"GitHub: Founding." Wikipedia, accessed May 2, 2026. https://en.wikipedia.org/wiki/GitHub#Founding.

²⁴

Sharwood, Simon. "HashiCorp co-founder Mitchell Hashimoto says GitHub 'no longer a place for serious work.'" The Register, April 29, 2026. https://www.theregister.com/2026/04/29/mitchell_hashimoto_ghostty_quitting_github/.

²⁵

Quirk, Kev. "Thoughts on Leaving GitHub." kevquirk.com, May 1, 2026. https://kevquirk.com/thoughts-on-leaving-github/.

²⁶

Reece, Manton. "Not leaving GitHub yet." manton.org, May 1, 2026. https://www.manton.org/2026/05/01/not-leaving-github-yet.html.

²⁷

Peters, Keith ("BIT-101"). "Migrating away from Github." bit-101.com, September 7, 2025. https://bit-101.com/blog/posts/2025-09-07/github/.

²⁸

"Forgejo." Wikipedia, accessed May 2, 2026. https://en.wikipedia.org/wiki/Forgejo.

²⁹

Ball, Kevin (host), and Harjot Gill (guest). "CodeRabbit and RAG for Code Review with Harjot Gill." Software Engineering Daily, June 24, 2025. https://softwareengineeringdaily.com/2025/06/24/coderabbit-and-rag-for-codereview-with-harjot-gill/. (Sponsored episode.)

³⁰

Giannini, Steren, and Harjot Gill. "50% faster merge and 50% fewer bugs: How CodeRabbit built its AI code review agent with Google Cloud Run." Google Cloud Blog, April 22, 2025. https://cloud.google.com/blog/products/ai-machine-learning/how-coderabbit-built-its-ai-code-review-agent-with-google-cloud-run.

³¹

Zhu, Qian. "Case Study: How CodeRabbit Leverages LanceDB for AI-Powered Code Reviews." LanceDB Blog, September 3, 2025. https://www.lancedb.com/blog/case-study-coderabbit.

³²

OpenAI. "Shipping code faster with o3, o4-mini, and GPT-4.1." OpenAI Stories, May 22, 2025. https://openai.com/index/coderabbit/.

³³

Amiet, Nils. "How We Exploited CodeRabbit: From a Simple PR to RCE and Write Access on 1M Repositories." Kudelski Security Research Center, August 19, 2025. https://research.kudelskisecurity.com/2025/08/19/how-we-exploited-coderabbit-from-a-simple-pr-to-rce-and-write-access-on-1m-repositories/.

³⁴

Gill, Harjot. "Our response to the January 2025 Kudelski Security vulnerability disclosure: Action & continuous improvement." CodeRabbit Blog, August 19, 2025. https://www.coderabbit.ai/blog/our-response-to-the-january-2025-kudelski-security-vulnerability-disclosure-action-and-continuous-improvement.

³⁵

go-gitea/gitea Issue #26023: "Support Pull Request Review Code Comment Webhooks." Filed July 20, 2023. https://github.com/go-gitea/gitea/issues/26023. (Open feature request — Forgejo, as a Gitea fork, inherits this gap.)