Pushing operational rules into Astro Content Collections with Zod

Contents14

Why a README isn’t enough
The schema, in one file
Coupling two fields with .refine()
.refine vs .superRefine
Typing structured data inside frontmatter
Where Zod can’t reach: body parity
What I gave up automating
The schema / lint / review split
Wrap-up
FAQ
When do you reach for .superRefine over .refine?
Why keep howto and faq inside the frontmatter instead of separate files?
What happens to existing posts when I change the schema?
Why isn’t the body-to-JSON-LD parity check wired into astro build?

Written rules get forgotten.

When you run a blog alone, the writer (you) breaks the writer’s (your) own rules sooner or later. Forgetting affiliate: true on a category: reviews post — and shipping it without the ad disclosure — is the kind of mistake a README check-list doesn’t catch.

Aulvem leans on Astro Content Collections and Zod to make rules that a machine can verify into build failures. What follows is where I draw the line between “let the schema catch it”, “let a lint script catch it”, and “leave it to review”.

This post is a follow-up to How this blog is built, which sets up the stack overview.

Why a README isn’t enough

A rule that lives only in a README has roughly zero enforcement.

The editor, reviewer, and writer are the same person, so there’s no second pair of eyes
Weeks later, the README isn’t going to be re-read line by line
Past me I trust; future me will plainly forget

I started with a checklist under docs/ and assumed that would be enough. By the third post I’d already missed an item. With no reviewer in the loop, the only way to force a rule is to put the check outside the writing flow — in the build. That’s why Aulvem leans on schema enforcement.

The schema, in one file

The Content Collections schema sits in src/content.config.ts with two collections (blog and services) declared together. Keeping them in one file means shared shapes (pubDate, updatedDate, heroImage) can be moved into local helpers without crossing file boundaries.

The blog schema looks like this (abridged):

import { defineCollection, z } from "astro:content";
import { glob } from "astro/loaders";

const blog = defineCollection({
  loader: glob({ pattern: "**/[^_]*.{md,mdx}", base: "./src/content/blog" }),
  schema: z
    .object({
      title: z.string(),
      description: z.string(),
      summary: z.string().optional(),
      pubDate: z.coerce.date(),
      updatedDate: z.coerce.date().optional(),
      category: z.enum(["build", "reviews"]),
      tags: z.array(z.string()).default([]),
      draft: z.boolean().default(false),
      affiliate: z.boolean().default(false),
      heroImage: z.string().optional(),
      heroAlt: z.string().optional(),
      howto: z.object({ /* ... */ }).optional(),
      faq: z.array(z.object({
        question: z.string(),
        answer: z.string(),
      })).optional(),
    })
    .refine((data) => (data.category === "reviews") === data.affiliate, {
      message: "affiliate must be true iff category is 'reviews'",
      path: ["affiliate"],
    }),
});

z.enum pins the category to two values, z.coerce.date swallows whatever date-string format the YAML emits, and .optional() + .default() are explicit about which fields the writer can leave blank. Plain tools cover most of the surface. The interesting work is the .refine() at the bottom, which is what the next section is about.

Coupling two fields with `.refine()`

When two fields must move together, I tack a .refine() onto the end of the schema. The case in Aulvem: a category: reviews post is, by affiliate-network rules, advertising, and has to carry the disclosure banner. Aulvem injects the banner (and adds rel="sponsored noopener noreferrer") only when affiliate: true. So the failure mode is publishing a review post with affiliate left at its default — disclosure missing, sponsored rel missing, ASP terms broken.

Here’s the one-liner that catches it:

.refine((data) => (data.category === "reviews") === data.affiliate, {
  message: "affiliate must be true iff category is 'reviews'",
  path: ["affiliate"],
})

Both sides compared with === means changing only one is a build failure. Reading the predicate as “these two are always equal” is clearer than writing the XOR explicitly — it’s the same logic, easier to scan three months later.

The build error reads like this:

[ContentEntryInvalidError] Content config error in `blog → 2026-05-...`:
affiliate must be true iff category is 'reviews'
  at affiliate

message ends up in the output verbatim, so it’s worth writing the message as instructions for future-you.

`.refine` vs `.superRefine`

Once you need more than one independent constraint on the same object, or per-field error messages, .superRefine() is the easier shape:

.superRefine((data, ctx) => {
  if (data.category === "reviews" && !data.affiliate) {
    ctx.addIssue({
      code: z.ZodIssueCode.custom,
      message: "reviews posts must set affiliate: true",
      path: ["affiliate"],
    });
  }
  if (data.draft && data.updatedDate) {
    ctx.addIssue({
      code: z.ZodIssueCode.custom,
      message: "draft posts should not carry updatedDate",
      path: ["updatedDate"],
    });
  }
})

For a single relationship between two fields, .refine() is still the lighter option.

Typing structured data inside frontmatter

I keep the data for the HowTo and FAQPage JSON-LD blocks in frontmatter rather than parsing the body. The reasons:

Parsing MDX headings to reconstruct structured data is brittle — a heading rename quietly breaks JSON-LD
Frontmatter is what Zod validates, so the shape is enforced for free
The JSON-LD generator can trust frontmatter without re-reading MDX

The schema looks like:

howto: z.object({
  name: z.string().optional(),
  description: z.string().optional(),
  totalTime: z.string().optional(),
  steps: z.array(z.object({
    name: z.string(),
    text: z.string(),
    image: z.string().optional(),
  })),
}).optional(),
faq: z.array(z.object({
  question: z.string(),
  answer: z.string(),
})).optional(),

Anything malformed — a howto with no steps, a faq entry missing answer — fails the build.

Where Zod can’t reach: body parity

Zod only inspects frontmatter. The body MDX is outside its scope.

Google’s quality guidelines flag “JSON-LD that has no body counterpart” as structured-data mismatch and pull the rich-result eligibility. A post with FAQ questions only in frontmatter — never echoed in the body — passes the schema and silently disqualifies itself.

The fix is a separate layer. Aulvem runs a validate-schema-match.mjs script that grep-checks the body for each faq[].question and howto.steps[].name:

if (Array.isArray(data?.faq)) {
  for (const [i, item] of data.faq.entries()) {
    if (!item?.question) continue;
    if (!bodyNorm.includes(normalise(item.question))) {
      mismatches.push(
        `faq[${i}].question not found in body: "${item.question}"`,
      );
    }
  }
}

It normalises whitespace and case, then checks for substring presence. If something is missing, the script returns exit code 1 and the build is blocked at the pre-commit / CI layer.

What it can’t catch: meaning. If the question matches but the answer underneath got swapped during editing, this layer says it’s fine.

What I gave up automating

Things neither the schema nor the grep validator can see:

Whether the FAQ answer is factually correct
How strong the disclosure language is (affiliate networks differ on what counts)
Whether the closing paragraph still reads like an AI draft
Whether the conclusion lines up with the article’s claim

These need a human read-through. Trying to encode them as a schema rule would cost more than the manual review they replace. I keep them in the pre-publish checklist and only push the patterns mechanically catchable (lint-banned-phrases.mjs for AI-tells) into automation.

The point isn’t “automate everything I can”. It’s deciding where build failures buy you more than they cost. With one person on the keyboard, automation time is finite, and that ratio matters.

The schema / lint / review split

Splitting rules across three layers makes “which layer should hold this?” answerable:

Layer	When it fires	What it catches	What it misses
Zod schema	`astro build`	types, enums, required/optional, field-to-field relationships	meaning, body-to-frontmatter parity
Lint script	pre-commit, CI	banned phrases, body/frontmatter substring parity	meaning
Review	pre-publish checklist	meaning, disclosure strength, AI-tell judgment	not automatable

Rule of thumb: if a higher layer can catch it, don’t push it down. Writing field-level constraints into a lint script means the build doesn’t fail and you have to remember to run the script. Pushing simple substring checks into Zod, conversely, makes the schema type definition harder to read.

Wrap-up

Structural constraints a machine can check go in the schema. Substring presence goes in lint. Meaning stays with review. Once that split is in place, every new operational rule can be sized against “which layer holds this?” instead of “where do I update the README?”.

The full Aulvem stack overview is in How this blog is built, if you want the bigger picture.

The same “frontmatter as the single source of truth” idea also drives the sitemap lastmod setup — written up separately in Reading sitemap lastmod from MDX frontmatter.

FAQ

When do you reach for `.superRefine` over `.refine`?

When one object needs more than one independent constraint, or when each issue needs its own message and path. For a single relationship between two fields, .refine() is enough. Aulvem only ties category and affiliate together, so it stays on .refine().

Why keep howto and faq inside the frontmatter instead of separate files?

Frontmatter gives me one source of truth that’s typed. Splitting them across files means the body, JSON-LD, and metadata each have their own home and have to be kept in sync by hand. Frontmatter as the source, JSON-LD generated from it, and the body echoing the same strings is the configuration with the fewest seams.

What happens to existing posts when I change the schema?

The build breaks. Adding a required field means every existing post needs the field before astro build will pass again. That’s intentional — a schema change forces you to confirm the rollout reach across every post.

Why isn’t the body-to-JSON-LD parity check wired into `astro build`?

To keep build time small. The parity check is a separate grep-based script run in pre-commit and CI. Folding it into the build itself would slow astro dev startup, which would hurt the writing loop.