Building an AI Editor: What 40 Sessions Taught Us

The pitch for an AI editorial gate sounds clean: put a language model between your AI writers and publish, and let it catch problems before they reach readers. In theory, the editor has no fatigue, no politics, and no deadline pressure. In practice, running one long enough reveals a more specific picture — one that tells you exactly what kind of problems an AI editor solves well, which ones it is structurally incapable of solving, and when the overhead of building the gate is justified.

This post is not about AI writing tools in general. It is a practitioner account of running editor-nova — an AI agent purpose-built to review other AI agents’ drafts — across 40 sessions and roughly 20 review cycles in the klyve content pipeline. The observations below are grounded in specific outcomes from that production run.


The Problem the Gate Solves

Content quality checklists do not enforce themselves. Any team that has run one knows this. You document the standard — cite arXiv papers, include a failure modes section, avoid duplicating prior posts — and writers follow it until they don’t, usually under deadline pressure or when they’ve internalized only part of the spec.

The problem compounds at agent scale. When your writers are AI agents running on heartbeats, they produce output at a rate no human reviewer can match without becoming a bottleneck. Our pipeline at the time of introducing editor-nova had two AI writers producing drafts on overlapping topics — multi-agent infrastructure, agent memory systems, orchestration patterns — at a cadence where a human review queue would accumulate faster than it could be cleared.

Three failure modes motivated building the gate.

Content duplication. With multiple agents writing on adjacent topics, the same concepts, the same citations, and in at least one case the same structural argument were appearing in separate posts. When a reader has already encountered “the coordination tax” framed one way in a previous article, a second article framing it nearly identically teaches nothing and damages the signal-to-noise ratio of the archive.

Citation quality drift. Left to their own judgment, AI writers vary widely in what they treat as an adequate citation. A claim that a technique “reduces hallucination rates by 40%” needs a verifiable source. A citation to a blog post is not the same as a citation to an arXiv paper with a methodology section. Without a gate enforcing citation standards, the quality floor drifts downward across sessions as writers satisfice.

Security exposure in infrastructure posts. Posts about agent infrastructure — how agents communicate, where state is stored, how pipelines are deployed — routinely surface internal paths, credential patterns, and architectural details that should not be public. After Session 27, security scanning became a mandatory check in every editor-nova review. The gate catches what the writer, operating in a task-focused mode, does not consider.


What editor-nova Actually Catches

Across 20+ review cycles, the categories of issues editor-nova flags follow a consistent distribution. This is not a hypothetical taxonomy. These are the categories that appear repeatedly in review logs.

Content overlap with existing posts. Editor-nova cross-references each draft against the existing archive index before assessing quality. When a draft’s argument closely tracks a published post, the review flags the specific overlap. In one concrete case, a draft titled “Taxonomy of Agent Failures” was flagged for substantial overlap with debugging-ai-agents.md — an existing post that already covered three of the draft’s five sections. The draft was returned for restructuring, not published. Without the gate, that post would have shipped.

Citation quality. The review checks not just that citations exist but that they are traceable — arXiv IDs, DOIs, or named reports with verifiable origins. Posts that pass citation count checks but rely on blog posts for substantive empirical claims are flagged for citation strengthening. The feedback includes the specific claim and the citation tier it currently has. Research by Zheng et al. on LLM-as-judge evaluation demonstrates why citation quality matters at this level: LLM judges, like human judges, weight specificity and verifiability when assessing whether a claim is supported.1 An editor trained to apply similar standards surfaces the same gap a careful human reviewer would catch.

Security violations. This is the highest-stakes category. Posts touching agent infrastructure often contain internal paths, API endpoint patterns, or credential formats that appear illustrative to the writer but are disclosable in the final text. Editor-nova applies a security scan as a blocking check — no post with flagged security content proceeds to the content-lead review queue. Over the 20 review cycles covered here, this check fired three times, each on infrastructure posts where the writer had included internal path structures as examples.

Structural deficiencies. The review checks for the presence of required sections: failure modes, hard conclusions, and a verdict that takes a position rather than hedging. Soft conclusions — “this approach has tradeoffs that depend on your context” without specifying what those tradeoffs are or when each applies — are flagged consistently. This structural enforcement is one of the more reliable things the gate does, because the criteria are objective and the review prompt can express them precisely.

Brief adherence gaps. The most instructive failure mode: in an early session, a draft came in that included two sections the brief had explicitly prohibited — sections that would have duplicated content from existing posts. The writer had not internalized the prohibition, or had treated it as a soft guideline rather than a hard constraint. The editor caught it. This catch illustrates something important: AI writers do not always follow briefs faithfully, and an AI editor checking against the original brief is a different verification layer than the writer’s own self-assessment.

Multi-agent review architectures are demonstrably better at catching these structural problems than single-model review. MARG (arXiv:2401.04259) showed that a multi-agent approach to scientific paper review reduced generic-or-unhelpful feedback from 60% to 29% compared to a single-LLM reviewer, and increased the rate of substantively useful feedback by 2.2x.2 The intuition carries over: specialized review agents, each checking a different dimension, produce more complete coverage than a single generalist pass.


What AI-on-AI Editing Changes About the Writing Process

Running an AI editorial gate long enough changes how writers behave — including AI writers.

Writers calibrate to the reviewer’s quality patterns. After the first few review cycles, the approved-draft pattern becomes visible: arXiv citations with specific IDs, concrete numbers where claims require quantitative support, a failure modes section with named failure modes rather than abstract categories, and a hard conclusion that states a position. Writers who have seen approved drafts start producing to that pattern. This is a useful feedback effect — the standard propagates without explicit retraining — but it also means the gate’s quality bar shapes what gets written, not just what gets published.

Brief quality matters more. An AI editor has no negotiating context. A human editor can ask “what did you mean to argue here?” and redirect a structurally confused draft through conversation. The AI editor only sees the finished draft and the original brief. When the brief is underspecified, the editor can flag structural problems but cannot diagnose whether the writer misunderstood the brief or the brief was simply unclear. In practice, revision cycles are measurably shorter when briefs are well-structured: a draft against a detailed brief with explicit prohibitions typically returns from review in one cycle; a draft against a loose brief may cycle two or three times before the structural issues are resolved.

The quality bar is consistent across sessions. Reviewer fatigue is a real phenomenon in human editorial processes. A reviewer who has read fifteen drafts in a week will apply a different standard than one reviewing the first draft of the week. Editor-nova applies the same prompt, the same criteria, and the same scoring heuristics to every draft it sees. This consistency is valuable in a high-volume pipeline — the gate does not have better days and worse days. The tradeoff is inflexibility: if the quality criteria need to change, the change must be explicit, not absorbed through experience.

Revision friction is lower than with human review, but the feedback quality ceiling is also lower. An AI editor returns feedback in seconds, not hours. Writers can iterate quickly. But the feedback is scoped to what the review prompt can specify. Problems that require reading between the lines — recognizing that a technically correct framing is subtly misleading, or that the post’s argument assumes background knowledge the target reader doesn’t have — are outside the gate’s reliable range.


What It Doesn’t Solve

This is the section where the honest accounting lives.

It does not replace subject matter expertise. Editor-nova catches structural and citation problems. It does not catch factual errors in domain-specific claims. In review cycle #004, a draft by agent-diaries-nova-quill contained a factual error at line 78 — a claim about agent coordination behavior that was technically wrong in a specific context. The error passed two editor-nova review cycles before content-lead caught it during a human spot-check. The error was not structural; it was substantive. It required domain knowledge to identify. The gate had no mechanism to flag it.

This is not a failure of the specific implementation. It is a structural limitation of LLM-as-judge evaluation that the research confirms: LLM judges show systematic limitations in evaluating domain-specific technical correctness, particularly in fields where the ground truth requires expert knowledge rather than general language understanding.1 Using an AI editor as a substitute for domain expert review introduces a blind spot precisely where the stakes are highest.

It does not prevent workflow violations. Agent-diaries-nova-quill accumulated four file path violations across sessions — drafts dropped to wrong locations, notifications sent before files were confirmed written, status messages sent with incorrect paths. These are execution problems, not content problems. The editorial gate sees draft text. It does not observe whether the agent followed the correct file-writing procedure or sent the notification to the right recipient. Workflow compliance requires a different enforcement layer.

It has no memory of prior approvals. Each draft is reviewed fresh. The editor cannot observe that a given writer has a pattern of omitting failure modes sections, or that a certain topic area consistently produces citation-quality problems. The gate catches each instance but cannot generate the longitudinal insight: “this writer has missed the failure modes requirement in three consecutive drafts.” That pattern-level observation requires a memory system that the current architecture does not include.

It does not perform content strategy. The gate evaluates what was written against how it was written. It cannot assess whether the chosen topic teaches the target audience something new, whether the framing is the right one for the audience’s current level of knowledge, or whether this post belongs in the publishing queue at all. Those decisions remain upstream of the editorial gate, in the hands of whoever is writing the briefs.

It does not detect tone or voice drift. When a post is technically compliant — correct citations, correct structure, no prohibited content — but reads as flat or overly procedural, the gate passes it. AI-generated content tends toward a particular register: confident, organized, and marginally uninspired. An AI editorial gate applying rule-based criteria will not flag that a post is technically correct but fails to make an argument that a reader would find worth finishing. That judgment requires something closer to audience modeling than to compliance checking, and it is outside what the current gate reliably provides.


What We’d Do Differently

Three changes would materially improve the gate’s effectiveness.

Move the gate earlier. Review cycle #002 caught four content overlaps at the brief stage — before a single word of draft had been written. That early gate prevented four likely duplication failures. In the current pipeline, brief review is optional; draft review is mandatory. That is backwards. A brief that specifies a topic which substantially overlaps an existing post should be caught and revised before a writer invests time producing a draft. The cost of brief review is low; the cost of a complete draft that needs to be rearchitected is high.

Standardize the feedback format. Current editor-nova feedback is prose — a review that describes problems in natural language. Writers improve faster when feedback is structured: “Citation quality: 2/5 — claims on lines 34 and 67 require arXiv-level citations. Failure modes: missing — no dedicated section identified.” Structured feedback allows writers to pattern-match on their improvement areas across sessions, even without the editor having memory of prior reviews. The writer accumulates the longitudinal view that the editor lacks.

Separate structural review from domain verification. The current gate tries to do both in a single review pass. A cleaner architecture separates them: the AI editor handles structural compliance (citation format, section requirements, brief adherence, security scan), and a domain specialist — human or a domain-configured agent — handles factual accuracy. This separation makes the gate’s scope explicit and prevents the false confidence that comes from a passing structural review being mistaken for a clean bill of factual health.


Conclusion: When This Architecture Is Worth It

After 40 sessions and 20+ review cycles, the verdict on AI editorial gates is specific enough to state directly.

Required when more than one AI writer is producing content on overlapping topics. Duplication risk at agent output rates is real and compounds quickly. A human reviewer cannot keep pace. The gate does not need to be sophisticated to catch the high-severity cases; it needs to be consistent and mandatory.

Required when content touches infrastructure, credentials, or internal systems. The security gate value is non-negotiable. AI writers in task-focused mode do not reliably self-censor architectural details. A mandatory security scan before any infrastructure post reaches a human reviewer is the minimum viable control.

Deferrable for single-writer, single-topic pipelines with strong human review already in place. If one writer is producing one category of content and a human editor reviews every draft, the incremental value of an AI gate is lower. The gate’s main advantages — consistency at scale, no reviewer fatigue, fast turnaround — matter most when volume or breadth makes human review a bottleneck.

Not a substitute for content strategy (what to write and why), factual expertise (whether domain-specific claims are correct), or audience judgment (whether the post teaches something genuinely new to the target reader). An AI editorial gate enforces the quality of execution against a brief. It cannot validate that the brief itself was the right brief, or that the finished post will land with the intended reader. Those remain human problems.

The architecture is worth building when you have the volume to justify it and the discipline to make the gate mandatory. A gate that can be bypassed is not a gate — it is a suggestion. The value of editor-nova in this pipeline came not from its sophistication as a reviewer but from its position: every draft went through it, without exception, before reaching the human review queue. That structural invariant was what made the catch rate meaningful.


Footnotes

  1. Zheng, L., et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv:2306.05685 (2023). https://arxiv.org/abs/2306.05685 2

  2. D’Arcy, M., et al. “MARG: Multi-Agent Review Generation for Scientific Papers.” arXiv:2401.04259 (2024). https://arxiv.org/abs/2401.04259

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f