Reading sitemap lastmod from MDX frontmatter — customising Astro's sitemap integration

Q: Isn't reading frontmatter through `fs.readdir` + regex unsafe compared to the Content Collections API?

It's not type-safe. But `astro.config.mjs` evaluates before the content loader is initialised, so the API isn't available at this point. The only fields the lastmod map needs are `updatedDate` and `pubDate`, so a light regex pass keeps the dependency surface flat. The real type check runs at build time through the Zod schema, so there are two nets, not one.

Contents13

Why lastmod matters
Walking MDX at config evaluation time
Feeding the map into serialize
Filtering out paginated noindex pages
How the pieces connect
Things I almost got wrong
When this should move back into the official integration
Wrap-up
FAQ
Why doesn’t @astrojs/sitemap read updatedDate on its own?
Isn’t reading frontmatter through fs.readdir + regex unsafe compared to the Content Collections API?
Is it really wrong to ship noindex URLs in the sitemap?
Does putting a real date in lastmod actually move AI search citations?

@astrojs/sitemap won’t read updatedDate out of MDX frontmatter.

Leave it at the defaults and the sitemap’s lastmod ends up being the build time, which broadcasts the noise signal that “every post was updated on every build” to search engines and AI search. The freshness hint stops being a hint and starts being misleading.

Aulvem solves this inside astro.config.mjs by walking the blog folder, building a path-to-date map with a lightweight regex, and feeding it into the serialize hook of @astrojs/sitemap. Paginated noindex pages are dropped from the sitemap in the same pass via filter. This is the second follow-up to How this blog is built.

Why `lastmod` matters

lastmod is the last-modified date in the sitemap protocol, and search engines use it as a re-crawl priority hint. AI search engines — ChatGPT Search, Perplexity, Claude Search — read it as a freshness anchor for citation eligibility.

Two specific reasons to get it right:

Crawl priority: recently updated URLs should get re-fetched sooner
AI citation: “this article was updated N days ago” feeds into the LLM’s decision to cite the page as current information

A bogus lastmod works the other way. Google’s own Sitemap docs note that unreliable lastmod values erode site-wide trust. Painting build-time onto every URL teaches Google that this site’s lastmod is noise, which is worse than not setting one at all. Pair the implementation with a strict updatedDate policy (see content rules A-2-1) and the signal stays meaningful.

Walking MDX at config evaluation time

The lastmod map is built once, inside astro.config.mjs, by reading every .mdx / .md under src/content/blog/{en,ja}/.

import { readdir, readFile } from "node:fs/promises";
import { join } from "node:path";

async function buildBlogLastmodMap() {
  const map = new Map();
  for (const lang of ["en", "ja"]) {
    const dir = join(process.cwd(), "src", "content", "blog", lang);
    let files = [];
    try {
      files = await readdir(dir);
    } catch {
      continue;
    }
    for (const file of files) {
      if (!file.endsWith(".mdx") && !file.endsWith(".md")) continue;
      const slug = file.replace(/\.(mdx|md)$/, "");
      const raw = await readFile(join(dir, file), "utf8");
      const fm = /^---\n([\s\S]*?)\n---/.exec(raw);
      if (!fm) continue;
      const front = fm[1];
      if (/^draft:\s*true/m.test(front)) continue;
      const updated = /^updatedDate:\s*(\S+)/m.exec(front);
      const pub = /^pubDate:\s*(\S+)/m.exec(front);
      const dateStr = (updated && updated[1]) || (pub && pub[1]);
      if (!dateStr) continue;
      const d = new Date(dateStr);
      if (Number.isNaN(d.getTime())) continue;
      const path = lang === "ja" ? `/ja/blog/${slug}/` : `/blog/${slug}/`;
      map.set(path, d.toISOString());
    }
  }
  return map;
}

const blogLastmod = await buildBlogLastmodMap();

The reason getCollection("blog") isn’t used here: astro.config.mjs evaluates before the content loader is initialised. The Content Collections API simply isn’t available when the config is being read.

Frontmatter is read with a light regex. There’s no need to fully parse MDX — only updatedDate and pubDate matter for the lastmod map, and adding a YAML parser as a dependency for two fields isn’t worth it. Type-safety is sacrificed at this layer, but the Zod schema validates every post during the actual build, so the safety net moves up one step rather than disappearing.

Skipping draft: true posts here is important. If they leak through, draft URLs end up in the sitemap, which is the opposite of what draft: true is supposed to express.

Feeding the map into `serialize`

The serialize option on @astrojs/sitemap is a hook that rewrites each emitted URL entry. The map gets looked up by URL path and the result is dropped into item.lastmod.

sitemap({
  i18n: {
    defaultLocale: "en",
    locales: { en: "en", ja: "ja" },
  },
  serialize(item) {
    const url = new URL(item.url);
    // Strip the /ja/ prefix so /ja/blog/foo/ and /blog/foo/ both
    // hit the same changefreq / priority branch. lastmod lookup
    // still uses the original url.pathname so the map keys match.
    const pathname = url.pathname.replace(/^\/ja\//, "/").replace(/^\/ja$/, "/");
    if (pathname === "/") {
      item.changefreq = "daily";
      item.priority = 1.0;
    } else if (pathname === "/blog/") {
      item.changefreq = "daily";
      item.priority = 0.9;
    } else if (pathname.startsWith("/blog/")) {
      item.changefreq = "monthly";
      item.priority = 0.7;
      const lastmod = blogLastmod.get(url.pathname);
      if (lastmod) item.lastmod = lastmod;
    } else if (pathname.startsWith("/products/")) {
      item.changefreq = "monthly";
      item.priority = 0.8;
    } else {
      item.changefreq = "monthly";
      item.priority = 0.5;
    }
    return item;
  },
});

changefreq and priority go in the same hook so each path category gets a consistent value. priority is officially “ignored” by Google these days, but Bing and the AI crawlers still read it, so I keep it consistent rather than dropping it entirely.

Stripping /ja/ for the branch decision but not for the lookup matters: the lookup needs to keep the language prefix because the map was built with /ja/blog/foo/ and /blog/foo/ as distinct keys.

Filtering out paginated `noindex` pages

Page 2 and beyond of the category listings (/blog/build/2/, /blog/reviews/3/, …) ship a <meta name="robots" content="noindex, follow">. Only page 1 should be indexed; later pages are crawlable but not index-eligible.

Leave those URLs in the sitemap and the signal collides. “Listed in sitemap” reads as “please index this”, which contradicts noindex. Google and Bing both treat that conflict as a quality smell, and there’s no upside to keeping them in.

sitemap({
  filter: (page) => {
    if (page.endsWith("/404/") || page.endsWith("/404")) return false;
    // Paginated category pages (/blog/build/2/, /blog/reviews/3/ ...)
    // are noindex, follow — drop them from the sitemap to avoid
    // contradicting the meta robots tag.
    if (/\/blog\/(build|reviews)\/\d+\/?$/.test(new URL(page).pathname)) return false;
    return true;
  },
  // ...
});

If you set noindex on paginated pages, dropping them from the sitemap is the matching half of the change. Doing one without the other leaves a half-fixed signal.

How the pieces connect

flowchart LR
  MDX[src/content/blog/&lt;lang&gt;/&lt;slug&gt;.mdx]
  Map[blogLastmod Map<br/>path → ISO date]
  Pages[@astrojs/sitemap<br/>URL list]
  Filter[filter:<br/>drop noindex]
  Serialize[serialize:<br/>lastmod / changefreq / priority]
  SM[sitemap.xml]

  MDX -->|fs.readdir + regex| Map
  Map -->|lookup| Serialize
  Pages --> Filter
  Filter --> Serialize
  Serialize --> SM

Two pipelines run inside astro.config.mjs. One walks MDX before the build starts and produces the lastmod map. The other is the standard Astro routing, whose URL list is filtered down and then enriched per-entry in serialize. Both converge into the final sitemap.xml.

Things I almost got wrong

A short list of pitfalls picked up while writing this:

Top-level await: works because astro.config.mjs is evaluated as ESM. Older repo layouts that still use astro.config.cjs won’t accept top-level await, and would need either a .mjs rename or an async wrapper inside defineConfig
draft: true filtering: skipping drafts in the map builder is necessary. Otherwise draft URLs land in the sitemap, which contradicts the whole point of marking them as drafts
Regex tightness: /^updatedDate:\s*(\S+)/m reads unquoted YAML like updatedDate: 2026-05-25. Quoted strings still work because \S+ captures "2026-05-25" whole and new Date() can parse it, but be aware if you ever use multi-line or list-style date values
Per-language folder merge: en and ja are walked separately and merged into a single map. Keys are kept distinct (/blog/<slug>/ vs /ja/blog/<slug>/) so the lookup inside serialize resolves correctly
updatedDate discipline: implementation alone isn’t enough — bumping updatedDate for trivial edits will still poison the signal. The matching policy lives in content rules A-2-1

When this should move back into the official integration

Honestly, this code is the shape of a @astrojs/sitemap feature request. Coupling Content Collections to the sitemap helper is a need other Astro users will have, and there’s room to generalize it as an option.

For now it lives inside Aulvem’s config at ~30 lines, which is a size I’m comfortable maintaining. If the official integration ever ships a Content Collections bridge, the bespoke code drops away. Keeping it isolated to astro.config.mjs makes that removal cheap.

Wrap-up

lastmod is a freshness signal that matters in both classic SEO and AI search, so feeding it real values from updatedDate is worth the 30 lines of bespoke code. Drop paginated noindex pages from the sitemap at the same time — handling them separately leaves the signal half-fixed.

The first follow-up in this series is on using the Zod schema to enforce operational rules at build time → Pushing operational rules into Astro Content Collections with Zod.

FAQ

Why doesn’t `@astrojs/sitemap` read `updatedDate` on its own?

The integration takes the list of built URLs and emits a sitemap.xml — it doesn’t reach into Content Collection frontmatter. Pulling MDX shape into a generic sitemap integration would mean coupling it to collection names, schemas, and loader types, which doesn’t sit well as a default responsibility.

Isn’t reading frontmatter through `fs.readdir` + regex unsafe compared to the Content Collections API?

It’s not type-safe. But astro.config.mjs evaluates before the content loader is initialised, so the API isn’t available at this point. The only fields the lastmod map needs are updatedDate and pubDate, so a light regex pass keeps the dependency surface flat. The real type check runs at build time through the Zod schema, so there are two nets, not one.

Is it really wrong to ship `noindex` URLs in the sitemap?

It’s a contradictory signal. A sitemap is a list of URLs you want indexed, so a noindex entry collides with that intent. Google and Bing both treat the conflict as a quality smell that can drag the site-wide signal down.

Does putting a real date in `lastmod` actually move AI search citations?

I haven’t measured the lift directly. What the docs do say — Google, Bing, and the AI search vendors — is that lastmod is a freshness hint and that bogus values erode trust. Treating it as insurance against losing existing trust feels more honest than claiming a citation uplift.

Reading sitemap lastmod from MDX frontmatter — customising Astro's sitemap integration

Why lastmod matters

Walking MDX at config evaluation time

Feeding the map into serialize

Filtering out paginated noindex pages