How this blog is built — Aulvem on Astro 5 and Content Collections

Q: How is full-text search implemented?

Pagefind runs after `astro build` and produces a static binary search index. The front end pulls it through a tens-of-KB WASM loader. No serverless function, no external API.

Contents15

Stack overview — Astro 5 + MDX + Content Collections + R2
Definition: what Content Collections is
Full stack — the main packages
Three pieces I wouldn’t drop
1. Enforce rules through the schema
2. Keep structured data and body text in sync
3. Read sitemap lastmod from frontmatter yourself
How the three pieces connect
Operational flow lives in a single source of truth
FAQ
Why Astro 5?
Why no headless CMS (Sanity, Contentful, etc.)?
How is full-text search implemented?
Does i18n use Astro’s built-in support as-is?
Wrap-up

This post is the starting point for documenting how Aulvem is built — not as a stack tour, but as a map of what we enforce through the schema and what we deliberately leave out. The stack itself is just Astro + MDX + Content Collections — nothing special there. What I want to write about is the design work on top of that: how operational rules a writer tends to forget get pushed into build-time checks.

The overview hub goes here. The individual decisions (schema enforcement, structured-data parity, custom sitemap lastmod) are split into follow-up posts.

Stack overview — Astro 5 + MDX + Content Collections + R2

Aulvem runs on Astro 5 (static output) + MDX + Content Collections + Tailwind 3 + Pagefind + Cloudflare R2. There’s no dynamic server — the site is just pre-built HTML served from Cloudflare Pages.

I narrowed the selection criteria to three points:

the build output is fully static
post metadata can be locked down by a schema
the core has a small dependency tree

Astro 5 met all three. The other features — partial hydration, UI framework integration, and so on — aren’t being used right now.

Whether Astro is the right pick depends on the requirements. If you want to layer on dynamic auth, comments, or subscriptions, something like Next.js is a better fit.

Definition: what Content Collections is

Aulvem defines two collections: blog (posts) and services (product pages). The schema lives in a single file and is validated at build time for both collections.

Full stack — the main packages

Runtime dependencies are kept to eight packages.

Package	Use	Notes
`astro` ^5	SSG core	static output only
`@astrojs/mdx` ^4	MDX support	all posts are `.mdx`
`@astrojs/sitemap` ^3	sitemap generation	`lastmod` injected manually
`@astrojs/rss` ^4	RSS generation	full `content:encoded`
`@astrojs/tailwind` ^6	Tailwind integration	`applyBaseStyles: false`
`rehype-external-links` ^3	rel attributes on external links	`noopener noreferrer`
`rehype-mermaid` ^3	Build-time mermaid → SVG	`inline-svg` strategy
`tailwindcss` ^3.4	styling	holding off on v4

On the dev side: pagefind (full-text search), sharp (local image processing), playwright (build-time SVG rendering for mermaid), typescript, and @types/node. No React, no Vue, no Vite plugins.

The opening rule is: don’t add a dependency on the hope it’ll be useful later. Unused dependencies show up in both build time and security-alert noise, so before adding one I check whether the use case is articulable in a sentence.

Three pieces I wouldn’t drop

1. Enforce rules through the schema

Operational rules don’t live in the README — they live in the frontmatter schema. I wanted the build to fail when a writer (including me) forgets, instead of leaning on memory.

For example, category: reviews posts have to be flagged as advertising under affiliate-network rules, so I require affiliate: true on them. Instead of leaving this as a line in the README, Zod’s .refine() forces category === "reviews" and affiliate === true to always be equivalent. Changing only one side breaks the build, so memory isn’t the load-bearing piece.

The same idea applies to the howto / faq structured data, but that belongs in a follow-up post.

2. Keep structured data and body text in sync

The JSON-LD (howto / faq) content is generated from the frontmatter, but the same content must also appear in the body. This avoids Google’s structured-data mismatch penalty — pumping out elaborate JSON-LD that has no corresponding body text violates Google’s quality guidelines, and getting caught costs you the structured-data display eligibility.

Aulvem’s rule is to write the same content on both sides — frontmatter and body — so a one-sided update can’t slip through. That alone closes off the failure mode where you fill in the frontmatter FAQ and forget the body.

3. Read sitemap lastmod from frontmatter yourself

Astro’s official sitemap integration doesn’t read updatedDate from MDX frontmatter. So Aulvem walks every MDX file with custom logic and pipes updatedDate ?? pubDate into the sitemap as lastmod.

Without that, the sitemap lastmod defaults to the build time, which broadcasts the noise signal that “every post was updated every build” to search engines. lastmod is also used as a freshness input by AI search (citation needed: confirmed in both Google’s and Bing’s docs), so it’s not a part to be sloppy on.

At the same time, paginated pages that return noindex, follow are dropped from the sitemap. Submitting a noindex URL through the sitemap is a contradictory signal, so both have to be handled together.

How the three pieces connect

flowchart LR
  FM[frontmatter] --> Zod
  FM --> Sitemap
  FM --> Sync
  Body[MDX body] --> Sync

  Zod["Zod schema check<br/>category ⇔ affiliate"] --> Build
  Sitemap["sitemap lastmod injection<br/>reads updatedDate"] --> Build
  Sync["frontmatter ↔ body<br/>parity"] --> Build

  Build[Astro build]

  Build --> HTML[HTML + JSON-LD + SVG]
  Build --> SM[sitemap.xml]

Two inputs — frontmatter and body — fan out through three build-time pieces and converge at Astro build. Frontmatter runs Zod schema validation and sitemap lastmod injection, while both frontmatter and body feed the parity check that makes sure the same content lives on both sides.

All results are unified by Astro build into the HTML, JSON-LD, and sitemap.xml. If any single piece fails, the build fails — so polishing one side alone won’t get you to publish.

Operational flow lives in a single source of truth

Operational flows — adding a post, adding a product, retiring a post — are anchored in one doc as the single source of truth. From there, the scaffolding and integrity-check scripts are wired in, and the rule is to follow the same steps every time. I think this absorbs most of the “the approach drifts run to run” kind of variance. What’s left is the warmth of the prose, which automation shouldn’t try to normalize anyway.

There’s also a rule: don’t bump updatedDate just because you added a preface note — bump it only on substantive revision. Search engines and AI search read dateModified as a freshness signal, so handling it loosely lowers the whole site’s authority. The criteria are fixed up front.

FAQ

Why Astro 5?

I narrowed it down to four constraints — Markdown-centric, fully static output, schema-typed frontmatter, and a small core dependency tree. Astro 5 was the best fit. Content Collections and MDX are built in, so there’s no need for extra plugins to get typed content.

Why no headless CMS (Sanity, Contentful, etc.)?

Publishing once or twice a week with one editor, MDX + git is faster than a CMS admin UI. A CMS adds the cost of draft locations, schema migrations, and API auth from day one. I’d revisit this when readership crosses ~10k, or when more editors come in.

How is full-text search implemented?

Pagefind runs after astro build and produces a static binary search index. The front end pulls it through a tens-of-KB WASM loader. No serverless function, no external API.

Does i18n use Astro’s built-in support as-is?

Routing uses Astro 5’s i18n (defaultLocale: en / prefixDefaultLocale: false) untouched. hreflang and sitemap lastmod, on the other hand, are read out of the MDX frontmatter through a custom helper. The details belong in a follow-up post.

Wrap-up

That’s the overview. Building and running this, what mattered most wasn’t which stack to pick — it was deciding how tightly to lock the frontmatter down with the schema.