Generating llms.txt and llms-full.txt in Astro for a Bilingual Site

Contents14

What llms.txt actually is
The minimal Astro version
Driving it from Content Collections
Splitting by language
Featured posts narrow to docLang
llms-full.txt — full index with language cross-references
How it sits next to robots.txt
Pitfalls and operating rules
Closing
FAQ
What’s the difference between llms.txt and llms-full.txt?
Will publishing llms.txt get my site read by AI and bring traffic?
Should a multilingual site split llms.txt per language?
If I already have robots.txt, do I need llms.txt too?

llms.txt is a proposed entry point for letting AI search and LLMs read a site. It sits in the same spot as robots.txt and sitemap.xml — a plain-text file at the site root. The spec isn’t settled, and no major LLM vendor officially promises to consume it. I still generate one, because the cost of doing so is small.

This post is the implementation record for generating llms.txt and llms-full.txt in Astro, split into English and Japanese. I’m not going to argue whether it works — I have no measurable traffic to point at and nothing to assert. This is only about what gets generated and how. It continues the same thread as enforcing rules with a Zod schema and reciprocal hreflang: treat frontmatter and Content Collections as the single source of truth.

What llms.txt actually is

llms.txt is a Markdown index for LLMs, placed at the site root. Where sitemap.xml is a machine-readable list of URLs, llms.txt describes — in natural language, with one-line notes — what the site is and where to start reading.

The llmstxt.org spec sets a loose shape: an H1 site name and a blockquote summary at the top, then H2 sections of annotated links below. Each link follows - [Title](URL): one-line note.

The spec also names a second convention, /llms-full.txt. That one isn’t an index but an expansion of each page’s body, so an LLM can grasp the content from a single file without following links. Aulvem ships both.

	llms.txt	llms-full.txt
Role	Light index (links + notes)	Full-text-ish index (excerpts + headings + tags)
Posts	Capped at seven featured	Every post
Language	Split per language (EN / JA)	One file, EN + JA mixed, each entry language-tagged
Size	Small	Large

The minimal Astro version

Astro’s file-based API routes let you return text by dropping a .txt.ts file under src/pages/. Return a Response from a GET handler with a text/plain content type.

// src/pages/llms.txt.ts
import type { APIContext } from "astro";
import { renderLlmsTxt } from "../lib/llmsTxt";

export async function GET(_context: APIContext) {
  const body = await renderLlmsTxt({ docLang: "en" });
  return new Response(body, {
    status: 200,
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Cache-Control": "public, max-age=3600",
    },
  });
}

The .txt.ts extension is what matters: it builds to the URL /llms.txt. I keep the generation logic in a separate file (src/lib/llmsTxt.ts) and leave the route as a thin entry point, so a per-language endpoint can reuse the same logic.

The stack is short.

	Package	Use
	`astro` (API routes + `astro:content`)	Route definitions and Content Collections access
	`typescript`	Typing the renderer

Driving it from Content Collections

The post list inside llms.txt isn’t written by hand. It’s built on the fly from Content Collections — getCollection for posts and products. A hand-kept list goes stale: you add a post, forget the index, and llms.txt drifts from the actual content.

// src/lib/llmsTxt.ts (excerpt)
export async function renderLlmsTxt(opts: LlmsTxtOptions): Promise<string> {
  const services = await getCollection("services");
  const blog = await getCollection("blog", ({ data }) => !data.draft);

  // newest first; products order by status (active → preparing → archived)
  blog.sort((a, b) => b.data.pubDate.getTime() - a.data.pubDate.getTime());
  // ...assemble sections and return join("\n")
}

The detail to get right is the ({ data }) => !data.draft filter that drops draft: true. Skip it and a half-written draft lands in llms.txt, advertising a URL you haven’t published. I keep the same exclusion that sitemap and RSS use.

For each post’s note, I reuse the frontmatter summary (falling back to description). Every post already carries a summary written as the short answer I want quoted by AI, so the llms.txt note and the post’s own TLDR come from one place. Frontmatter stays the single source here too.

Splitting by language

This is the part that matters for a multilingual site. Aulvem drives the renderer on two axes:

filterLang: which language’s posts and products to list (ja lists only Japanese entries)
docLang: which language the document’s own text — headings, summary, usage terms — is written in

Separating those two lets one renderer emit three endpoints.

// src/pages/llms.txt.ts        → English headings, posts from all languages
renderLlmsTxt({ docLang: "en" });

// src/pages/ja/llms.txt.ts     → Japanese headings, Japanese posts only
renderLlmsTxt({ filterLang: "ja", docLang: "ja" });

The English route /llms.txt leaves filterLang unset on purpose. I treat the English version as the whole-site entry point, so it can surface posts in either language. The Japanese version /ja/llms.txt closes to the Japanese surface with filterLang: "ja".

flowchart LR
  CC[Content Collections<br/>blog + services]
  R[renderLlmsTxt<br/>filterLang / docLang]
  EN["/llms.txt<br/>docLang=en"]
  JA["/ja/llms.txt<br/>filterLang=ja, docLang=ja"]

  CC -->|getCollection| R
  R --> EN
  R --> JA

Featured posts narrow to docLang

One design call sits inside this. The English /llms.txt can surface posts from both languages, but the Featured posts section alone is narrowed to docLang.

// featured posts only in the doc language
const featuredSource = filteredBlog.filter(
  (p) => entryLangLocal(p.id) === opts.docLang,
);

Listing both halves of a translation pair in the featured slots spends two slots on one piece of content and halves the unique signal in a limited list. So the featured slots align to the document language, and the cross-language full set lives in llms-full.txt. A bounded list (seven featured posts) narrows by language; a full dump with loose size limits carries both. That’s the split.

llms-full.txt — full index with language cross-references

llms-full.txt is a separate route (src/pages/llms-full.txt.ts) that expands every post’s excerpt, headings and tags into one file. The MDX body runs through stripMdx to drop the syntax, then clip to 500 characters.

Each entry gets a language tag ((ja) / (en)), and if a translation pair exists, a Lang-Alt line cross-references the other language’s URL.

// add the paired language version as Lang-Alt
const alt = altByBase.get(slug);
if (alt) {
  const otherLang = lang === "ja" ? "en" : "ja";
  const otherPath = alt[otherLang];
  if (otherPath) {
    lines.push(`Lang-Alt (${otherLang}): ${SITE}${otherPath}`);
  }
}

This is the same idea as hreflang. I wrote up emitting reciprocal alternate links in HTML only for posts that exist in both languages in the hreflang post; llms-full.txt makes the same statement to an LLM — “this post has a paired version over here.” The aim is to keep the English and Japanese halves of one post from being counted as two unrelated documents.

How it sits next to robots.txt

llms.txt doesn’t replace robots.txt. robots.txt states whether a crawler may fetch; llms.txt states what the site contains and where the good parts are. The final permission to train or quote stays with robots.txt.

Aulvem puts a usage-and-citation section inside llms.txt: quoting is welcome, link back to the source URL, training is permitted for now. The section ends by pointing back to robots.txt as the authoritative bot policy — the machine-readable authority lives there, and llms.txt is a note, not an override. I’d keep that framing.

Leave it vague and you get a mismatch: llms.txt says “no training” while robots.txt allows everything. The two files’ stances have to be kept aligned by hand.

Pitfalls and operating rules

A few things I nearly tripped on:

Apply the draft exclusion across all three routes: put the getCollection filter in the shared renderer so llms.txt, ja/llms.txt and llms-full.txt all pass the same exclusion. Let one route through raw and drafts leak
Don’t conflate filterLang and docLang: “which posts to list” and “what language to write in” are different axes. The English version leaves filterLang unset to surface Japanese posts too — don’t read that later as a bug and add a filter
Featured language narrowing: only Featured narrows by docLang. Drop that and translation pairs fill the slots
Set Cache-Control: all three use max-age=3600. Generation is cheap, but there’s no need to rebuild on every request
Keep usage terms and robots.txt in sync: change the stance and edit both. Don’t let llms.txt stand alone

Closing

llms.txt, generated from Content Collections as an Astro API route, drops the manual upkeep every time you add a post. For multiple languages, splitting on filterLang (which posts) and docLang (the wording) lets one renderer emit the English, Japanese and full-text versions.

Whether it does anything is a separate question I’m leaving alone. How llms.txt feeds AI-search traffic isn’t a settled convention or a measured one yet. Aulvem’s stance is to generate it because it’s cheap, and to keep the authority in robots.txt.

Treating frontmatter as the single source of truth is the thread that runs from the Aulvem blog architecture post onward, if you want the wider context.

FAQ

What’s the difference between llms.txt and llms-full.txt?

llms.txt is a light index — annotated links that say what the site is and where to start. llms-full.txt carries each post’s excerpt, headings and tags as a closer-to-full-text dump. Aulvem treats the first as a curated list capped at seven featured posts, and the second as a full excerpt dump of every post.

Will publishing llms.txt get my site read by AI and bring traffic?

I can’t claim that. llms.txt is a proposed convention, and no major LLM vendor officially guarantees it consumes the file. robots.txt and sitemap.xml are still the authoritative bot signals. Aulvem keeps llms.txt because it’s cheap to generate, not because there’s measurable traffic to point to.

Should a multilingual site split llms.txt per language?

Aulvem serves a Japanese version (/ja/llms.txt) separate from the English root (/llms.txt). I want the headings and usage terms in the reader’s language, and the featured posts narrowed to that language. The full-text llms-full.txt stays as a single bilingual file, kept as a separate track from the two light indexes.

If I already have robots.txt, do I need llms.txt too?

They do different jobs. robots.txt states what crawlers may fetch; llms.txt describes what the site contains and where to read. The authoritative permission for training and citation lives in robots.txt — llms.txt is a guide on top of it. Aulvem’s usage section even points back to robots.txt as the final word.