Context Crawler: AI-Powered Website and Repository Analysis for Smarter Translations

How Better i18n Builds AI Context From Your Product

Most localization workflows start with a manual step: someone tags strings, exports files, and hopes nothing was missed. Better i18n takes a different approach. The Context Crawler combines two powerful analysis modes — website crawling and repository scanning — to automatically build rich context that makes AI translations dramatically more accurate.

The result is a workflow where your AI translator understands your brand, your product vocabulary, and your technical stack before it translates a single string.

Website Analysis: Firecrawl-Powered Brand Discovery

The Context Crawler's website analysis mode uses the Firecrawl API to systematically crawl and extract content from any URL you provide. This is not simple HTML scraping — Firecrawl handles JavaScript-rendered pages, SPAs, and dynamic content to capture what users actually see.

How Website Analysis Works

Provide a URL — Enter your product's marketing site, documentation, or any web property. The crawler accepts any publicly accessible URL.
Firecrawl extraction — The Firecrawl API renders pages, follows internal links, and extracts structured content including headings, body text, navigation labels, CTAs, and metadata.
Terminology detection — AI analysis identifies repeated terms, branded language, product names, feature names, and domain-specific vocabulary across all crawled pages.
Candidate proposal — Detected terms are proposed as glossary candidates with suggested definitions, context notes, and frequency data. Each candidate includes the source URLs where it was found.

What Website Analysis Captures

Brand terminology — Product names, feature names, pricing tier labels, and coined terms that appear consistently across your site
Navigation patterns — Menu labels, breadcrumb text, and CTA language that define your product's vocabulary
Domain vocabulary — Industry-specific terms your product uses that require precise, consistent translation
Tone and voice signals — Whether your content is formal, casual, technical, or marketing-oriented — context that helps AI translate with appropriate register

Use Cases for Website Analysis

New project setup — Crawl your marketing site before starting localization to bootstrap a glossary and context profile in minutes rather than weeks
Competitor analysis — Crawl competitor sites in target markets to understand how they translate similar concepts, informing your own glossary decisions
Content audit — Periodically re-crawl to detect new terminology that has emerged since your last glossary update

Repository Analysis: Framework and Terminology Detection

The Context Crawler's repository analysis mode connects to your GitHub repository and performs deep analysis of your codebase to extract translation-relevant context.

How Repository Analysis Works

Connect a GitHub repo — Provide the repository URL. The crawler accesses public repos directly and private repos through your GitHub integration.
Framework detection — The analyzer identifies your tech stack: React, Next.js, Vue, Angular, Svelte, Flutter, React Native, and others. This determines which i18n patterns to look for and how to parse your code.
i18n pattern recognition — Based on the detected framework, the analyzer finds existing translation function calls (t(), useTranslations(), $t(), tr(), etc.) and maps how your project structures its translations.
Terminology extraction — The analyzer identifies hardcoded strings, component names, route labels, and other user-facing text that represents your product vocabulary.
Context profile generation — All findings are compiled into a context profile that includes detected frameworks, i18n library usage, namespace structure, and terminology candidates.

What Repository Analysis Detects

Framework and i18n library — Automatic detection of your stack so AI translations use the correct formatting conventions (ICU MessageFormat, i18next interpolation, Flutter ARB, etc.)
Namespace structure — How your project organizes translation keys, so new translations follow the same patterns
Existing terminology — Terms already in use across your codebase, helping identify what should be added to the glossary
String patterns — Common patterns in your user-facing strings (date formats, number formats, pluralization approaches) that inform translation rules

AST-Based Key Detection

Beyond context building, the CLI's scan command parses your source code at the syntax tree level, not with fragile regex matching. It understands React JSX, useTranslations, and getTranslations patterns natively. This means it can distinguish between a user-facing string like <h1>Welcome back</h1> and a technical string like className="flex items-center" without false positives.

What it detects:

Hardcoded JSX text — content like <h1>Hello</h1> that should be wrapped in t() calls
Hardcoded JSX attributes — user-visible attributes like <img alt="Company logo" />
Toast and notification strings — calls like toast.error("Something went wrong")
Locale-based ternary logic — patterns like locale === 'en' ? 'Hi' : 'Hola' that indicate manual locale handling
String variables — variables containing text that appears to be user-facing

What it intelligently ignores:

Tailwind and CSS class names
URLs, file paths, and image sources
HTML entities (&, ")
Technical constants in SCREAMING_CASE
Numbers and non-textual values

better-i18n scan              # Scan current directory
better-i18n scan --dir ./src  # Scan specific directory
better-i18n scan --staged     # Only staged files (for pre-commit hooks)
better-i18n scan --ci         # Exit code 1 if issues found
better-i18n scan --format json # JSON output for tooling
better-i18n scan --verbose    # Detailed output with scan audit

Namespace Resolution

The scanner uses lexical scope tracking to resolve namespaces automatically for both client and server components. This is important because it means detected keys include their full namespace path, not just the key name.

For client components using hooks:

const t = useTranslations('hero');
return <h1>{t('title')}</h1>; // Detected as: hero.title

For server components using async functions:

const t = await getTranslations('welcome');
return <h1>{t('title')}</h1>; // Detected as: welcome.title

The scanner also handles the object form (getTranslations({ locale, namespace: 'settings' })) and root-scoped translators where no namespace is provided. Dynamic namespaces using variables or template literals are reported in --verbose mode and excluded from metrics to avoid false positives.

From Analysis to Glossary: The Full Pipeline

The Context Crawler's website and repository analysis modes feed directly into the Glossary Management system:

Crawl — Website analysis via Firecrawl extracts brand terminology; repo analysis detects framework context and existing terms.
Propose — Detected terms are proposed as glossary candidates with draft status.
Review — Your team reviews proposed terms, editing definitions and translations as needed.
Approve — Approved terms are immediately enforced in AI translation and the review editor.
Sync — Approved terms can be synced to DeepL for provider-level enforcement.

This pipeline means you can go from "we have no glossary" to "our AI translations enforce 200 brand terms" in a single afternoon.

Sync: Comparing Local vs Cloud

The sync command bridges the gap between what your code uses and what exists in Better i18n. It scans your codebase the same way scan does, then queries the Better i18n API to compare key sets.

The output is a clear comparison report:

Missing in Remote — keys referenced in your code but not yet uploaded to Better i18n. These are strings your users might see untranslated.
Unused in Code — keys that exist in Better i18n but are no longer referenced anywhere in your source. These are candidates for cleanup.

better-i18n sync              # Grouped tree output
better-i18n sync --summary    # High-level coverage metrics only
better-i18n sync --format json # JSON output for CI automation
better-i18n sync -d ./src     # Scan specific directory

The tree output groups keys by namespace, making it easy to see which parts of your app have gaps. The --verbose flag provides a deep audit log including invariant checks, scoping summaries, and specific key probes.

Coverage Metrics

The sync command provides percentage-based coverage metrics:

Local to Remote coverage: What percentage of keys used in your code exist in Better i18n
Remote usage: What percentage of keys in Better i18n are actually used in your code

These numbers give your team a clear picture of translation health at any point in time. You can view them in the terminal output or extract them from the JSON output for custom dashboards and reporting.

CI Integration

Both scan and sync are designed to run in automated pipelines. Use --ci with scan to fail builds when hardcoded strings are detected, and pipe sync output through jq to gate deploys on missing key counts.

# GitHub Actions example
name: i18n Check
on: [push, pull_request]
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx @better-i18n/cli scan --ci
      - run: |
          npx @better-i18n/cli sync --format json \
            | jq -e '.comparison.missingCount == 0' > /dev/null || exit 1

For pre-commit hooks, scan only staged files to keep feedback fast:

npx husky init
echo "npx @better-i18n/cli scan --staged --ci" > .husky/pre-commit

What This Does Not Do

To set clear expectations:

No visual context capture — the CLI works at the code level, not the rendered UI. There are no screenshots or visual previews of where strings appear.
No real-time monitoring — scan and sync are run on-demand or in CI pipelines; they are not background watchers or file system observers
No stale translation detection — the sync command shows missing and unused keys, but does not detect whether an existing translation is outdated relative to a source text change

Getting Started

Install the CLI and run your first scan in under a minute:

npm install -g @better-i18n/cli
better-i18n scan --dir ./src

Then connect to your Better i18n project and compare against the cloud:

better-i18n sync

To start website analysis, visit the AI Context section in your project dashboard, enter a URL, and let Firecrawl extract your brand terminology. For repository analysis, connect your GitHub repo and let the analyzer detect your framework and existing terminology.

See the full CLI documentation for configuration options, detection rules, and advanced usage.

Ready to automate your translation context and terminology discovery? Create your account and connect your first project — or learn how the Glossary Management system enforces the terms the crawler discovers.

Context Crawler: AI-Powered Website and Repository Analysis for Smarter Translations

How Better i18n Builds AI Context From Your Product

Website Analysis: Firecrawl-Powered Brand Discovery

How Website Analysis Works

What Website Analysis Captures

Use Cases for Website Analysis

Repository Analysis: Framework and Terminology Detection

How Repository Analysis Works

What Repository Analysis Detects

AST-Based Key Detection

Namespace Resolution

From Analysis to Glossary: The Full Pipeline

Sync: Comparing Local vs Cloud

Coverage Metrics

CI Integration

What This Does Not Do

Getting Started

探索更多

Translation Sync Engine — Zuverlässige Async-Verarbeitung für Ihre Lokalisierungs-Pipeline mit better-i18n

i18n Health Check: Automated Translation Quality Monitoring

Operaciones Batch de better-i18n para Gestión de Traducciones a Escala Enterprise

开发者体验与平台用户体验 — 为速度而构建，为愉悦而设计

本地化项目的媒体管理与内容资产

Enterprise Security & Compliance for Translation Teams

探索更多

面向开发者

面向翻译人员

面向产品团队