Context Crawler: AI-Powered Website and Repository Analysis for Smarter Translations
How Better i18n Builds AI Context From Your Product
Most localization workflows start with a manual step: someone tags strings, exports files, and hopes nothing was missed. Better i18n takes a different approach. The Context Crawler combines two powerful analysis modes — website crawling and repository scanning — to automatically build rich context that makes AI translations dramatically more accurate.
The result is a workflow where your AI translator understands your brand, your product vocabulary, and your technical stack before it translates a single string.
Website Analysis: Firecrawl-Powered Brand Discovery
The Context Crawler's website analysis mode uses the Firecrawl API to systematically crawl and extract content from any URL you provide. This is not simple HTML scraping — Firecrawl handles JavaScript-rendered pages, SPAs, and dynamic content to capture what users actually see.
How Website Analysis Works
- Provide a URL — Enter your product's marketing site, documentation, or any web property. The crawler accepts any publicly accessible URL.
- Firecrawl extraction — The Firecrawl API renders pages, follows internal links, and extracts structured content including headings, body text, navigation labels, CTAs, and metadata.
- Terminology detection — AI analysis identifies repeated terms, branded language, product names, feature names, and domain-specific vocabulary across all crawled pages.
- Candidate proposal — Detected terms are proposed as glossary candidates with suggested definitions, context notes, and frequency data. Each candidate includes the source URLs where it was found.
What Website Analysis Captures
- Brand terminology — Product names, feature names, pricing tier labels, and coined terms that appear consistently across your site
- Navigation patterns — Menu labels, breadcrumb text, and CTA language that define your product's vocabulary
- Domain vocabulary — Industry-specific terms your product uses that require precise, consistent translation
- Tone and voice signals — Whether your content is formal, casual, technical, or marketing-oriented — context that helps AI translate with appropriate register
Use Cases for Website Analysis
- New project setup — Crawl your marketing site before starting localization to bootstrap a glossary and context profile in minutes rather than weeks
- Competitor analysis — Crawl competitor sites in target markets to understand how they translate similar concepts, informing your own glossary decisions
- Content audit — Periodically re-crawl to detect new terminology that has emerged since your last glossary update
Repository Analysis: Framework and Terminology Detection
The Context Crawler's repository analysis mode connects to your GitHub repository and performs deep analysis of your codebase to extract translation-relevant context.
How Repository Analysis Works
- Connect a GitHub repo — Provide the repository URL. The crawler accesses public repos directly and private repos through your GitHub integration.
- Framework detection — The analyzer identifies your tech stack: React, Next.js, Vue, Angular, Svelte, Flutter, React Native, and others. This determines which i18n patterns to look for and how to parse your code.
- i18n pattern recognition — Based on the detected framework, the analyzer finds existing translation function calls (
t(),useTranslations(),$t(),tr(), etc.) and maps how your project structures its translations. - Terminology extraction — The analyzer identifies hardcoded strings, component names, route labels, and other user-facing text that represents your product vocabulary.
- Context profile generation — All findings are compiled into a context profile that includes detected frameworks, i18n library usage, namespace structure, and terminology candidates.
What Repository Analysis Detects
- Framework and i18n library — Automatic detection of your stack so AI translations use the correct formatting conventions (ICU MessageFormat, i18next interpolation, Flutter ARB, etc.)
- Namespace structure — How your project organizes translation keys, so new translations follow the same patterns
- Existing terminology — Terms already in use across your codebase, helping identify what should be added to the glossary
- String patterns — Common patterns in your user-facing strings (date formats, number formats, pluralization approaches) that inform translation rules
AST-Based Key Detection
Beyond context building, the CLI's scan command parses your source code at the syntax tree level, not with fragile regex matching. It understands React JSX, useTranslations, and getTranslations patterns natively. This means it can distinguish between a user-facing string like <h1>Welcome back</h1> and a technical string like className="flex items-center" without false positives.
What it detects:
- Hardcoded JSX text — content like
<h1>Hello</h1>that should be wrapped int()calls - Hardcoded JSX attributes — user-visible attributes like
<img alt="Company logo" /> - Toast and notification strings — calls like
toast.error("Something went wrong") - Locale-based ternary logic — patterns like
locale === 'en' ? 'Hi' : 'Hola'that indicate manual locale handling - String variables — variables containing text that appears to be user-facing
What it intelligently ignores:
- Tailwind and CSS class names
- URLs, file paths, and image sources
- HTML entities (
&,") - Technical constants in
SCREAMING_CASE - Numbers and non-textual values
better-i18n scan # Scan current directory
better-i18n scan --dir ./src # Scan specific directory
better-i18n scan --staged # Only staged files (for pre-commit hooks)
better-i18n scan --ci # Exit code 1 if issues found
better-i18n scan --format json # JSON output for tooling
better-i18n scan --verbose # Detailed output with scan audit
Namespace Resolution
The scanner uses lexical scope tracking to resolve namespaces automatically for both client and server components. This is important because it means detected keys include their full namespace path, not just the key name.
For client components using hooks:
const t = useTranslations('hero');
return <h1>{t('title')}</h1>; // Detected as: hero.title
For server components using async functions:
const t = await getTranslations('welcome');
return <h1>{t('title')}</h1>; // Detected as: welcome.title
The scanner also handles the object form (getTranslations({ locale, namespace: 'settings' })) and root-scoped translators where no namespace is provided. Dynamic namespaces using variables or template literals are reported in --verbose mode and excluded from metrics to avoid false positives.
From Analysis to Glossary: The Full Pipeline
The Context Crawler's website and repository analysis modes feed directly into the Glossary Management system:
- Crawl — Website analysis via Firecrawl extracts brand terminology; repo analysis detects framework context and existing terms.
- Propose — Detected terms are proposed as glossary candidates with draft status.
- Review — Your team reviews proposed terms, editing definitions and translations as needed.
- Approve — Approved terms are immediately enforced in AI translation and the review editor.
- Sync — Approved terms can be synced to DeepL for provider-level enforcement.
This pipeline means you can go from "we have no glossary" to "our AI translations enforce 200 brand terms" in a single afternoon.
Sync: Comparing Local vs Cloud
The sync command bridges the gap between what your code uses and what exists in Better i18n. It scans your codebase the same way scan does, then queries the Better i18n API to compare key sets.
The output is a clear comparison report:
- Missing in Remote — keys referenced in your code but not yet uploaded to Better i18n. These are strings your users might see untranslated.
- Unused in Code — keys that exist in Better i18n but are no longer referenced anywhere in your source. These are candidates for cleanup.
better-i18n sync # Grouped tree output
better-i18n sync --summary # High-level coverage metrics only
better-i18n sync --format json # JSON output for CI automation
better-i18n sync -d ./src # Scan specific directory
The tree output groups keys by namespace, making it easy to see which parts of your app have gaps. The --verbose flag provides a deep audit log including invariant checks, scoping summaries, and specific key probes.
Coverage Metrics
The sync command provides percentage-based coverage metrics:
- Local to Remote coverage: What percentage of keys used in your code exist in Better i18n
- Remote usage: What percentage of keys in Better i18n are actually used in your code
These numbers give your team a clear picture of translation health at any point in time. You can view them in the terminal output or extract them from the JSON output for custom dashboards and reporting.
CI Integration
Both scan and sync are designed to run in automated pipelines. Use --ci with scan to fail builds when hardcoded strings are detected, and pipe sync output through jq to gate deploys on missing key counts.
# GitHub Actions example
name: i18n Check
on: [push, pull_request]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx @better-i18n/cli scan --ci
- run: |
npx @better-i18n/cli sync --format json \
| jq -e '.comparison.missingCount == 0' > /dev/null || exit 1
For pre-commit hooks, scan only staged files to keep feedback fast:
npx husky init
echo "npx @better-i18n/cli scan --staged --ci" > .husky/pre-commit
What This Does Not Do
To set clear expectations:
- No visual context capture — the CLI works at the code level, not the rendered UI. There are no screenshots or visual previews of where strings appear.
- No real-time monitoring — scan and sync are run on-demand or in CI pipelines; they are not background watchers or file system observers
- No stale translation detection — the sync command shows missing and unused keys, but does not detect whether an existing translation is outdated relative to a source text change
Getting Started
Install the CLI and run your first scan in under a minute:
npm install -g @better-i18n/cli
better-i18n scan --dir ./src
Then connect to your Better i18n project and compare against the cloud:
better-i18n sync
To start website analysis, visit the AI Context section in your project dashboard, enter a URL, and let Firecrawl extract your brand terminology. For repository analysis, connect your GitHub repo and let the analyzer detect your framework and existing terminology.
See the full CLI documentation for configuration options, detection rules, and advanced usage.
Ready to automate your translation context and terminology discovery? Create your account and connect your first project — or learn how the Glossary Management system enforces the terms the crawler discovers.