Engineering

Inside the Translation Sync Engine: How We Built a Reliable Async Pipeline for Localization

Eray Gündoğmuş
Eray Gündoğmuş
·8 min read
Share
Inside the Translation Sync Engine: How We Built a Reliable Async Pipeline for Localization

Translation management sounds simple until you try to keep three systems in sync: a Git repository where developers write code, a database where translators work, and a CDN where your app fetches translations at runtime. Change a key in one system, and the other two need to know about it — reliably, quickly, and without losing data.

This is the problem we built the sync engine to solve. In this post, we will walk through the architecture, the message types, the conflict detection system, and the reliability guarantees that make it all work.


The Problem with Synchronous Translation Workflows

Early in Better i18n's development, translation syncs were synchronous. A developer would push code, our webhook handler would process the changes inline, update the database, regenerate CDN files, and return a response. It worked — until it did not.

The failure modes were predictable:

  • Timeouts. A repository with 5,000 keys takes time to diff. GitHub webhooks have a 10-second timeout. Syncs would silently fail on large projects.
  • Partial updates. If the CDN upload failed after the database was updated, translations would be out of sync. Users would see stale content until someone manually triggered a re-sync.
  • No visibility. When a sync failed, there was no record of what happened. Debugging required reading server logs and correlating timestamps.

We needed an architecture that decoupled the trigger from the work, provided automatic retries, and gave full visibility into every operation.


Enter Cloudflare Queues

We chose Cloudflare Queues as the backbone of the sync engine. Queues provide durable, ordered message delivery with at-least-once semantics — exactly what we needed.

The architecture is straightforward:

GitHub Webhook → API Handler → Queue (enqueue message) → Worker (process message)
                                                              ↓
                                                     Activity Log + Database + CDN

The API handler does minimal work: validate the webhook, enqueue a REPO_PUSH_SYNC message, and return a 200. The actual processing happens asynchronously in the queue consumer — a Cloudflare Worker that picks up messages and executes them.

This separation has three immediate benefits:

  1. Webhook responses are fast. No more timeouts, even for massive repositories.
  2. Failures are retried automatically. If the worker crashes or an API call fails, the message is redelivered with exponential backoff.
  3. Operations are observable. Every message produces a structured activity log.

10 Message Types, One Consumer

The sync engine processes 10 distinct message types, each with its own handler:

Sync operations:

  • SYNC_START — Full or incremental GitHub sync. Fetches files, compares keys, updates the database, and optionally generates a pull request with new translations.
  • REPO_PUSH_SYNC — Optimized path for push webhook events. Only processes files that changed in the push, making incremental syncs near-instant.

CDN operations:

  • CDN_SETUP — Creates the initial manifest and empty language files when a project connects its CDN.
  • CDN_UPLOAD — Writes a single JSON translation file to R2 storage.
  • CDN_MERGE — Merges new translations into an existing CDN file. This is critical for partial publishes — you want to add new translations without removing unchanged ones.
  • CDN_CLEANUP — Deletes all R2 files for a project. Used during project deletion or when a user wants to start fresh.

AI operations:

  • AI_CONTEXT_ANALYSIS — Uses Firecrawl to scrape the project's website, then feeds the content to Gemini to build a translation context model. This context helps machine translation understand industry-specific terminology.
  • REPO_ANALYSIS — Scans the GitHub repository to detect the framework (React, Next.js, Flutter, etc.), extract existing translations, and build a terminology glossary.

Publishing:

  • PUBLISH_BATCH — The final step in the translation workflow. Takes approved translations and pushes them to both the CDN (for immediate availability) and GitHub (for version control). This is an atomic operation — if either write fails, the entire publish is retried.

Glossary:

  • GLOSSARY_SYNC — Synchronizes terminology glossaries with DeepL. When you define that "workspace" should always translate to "espace de travail" in French, this message ensures DeepL's glossary is updated so all future machine translations are consistent.

Each message type is isolated. A failure in CDN_UPLOAD does not block SYNC_START. A slow AI_CONTEXT_ANALYSIS does not delay PUBLISH_BATCH. This isolation is key to the engine's reliability.


The Job System

Messages are low-level. Jobs are the high-level workflows that users and the system interact with. The sync engine supports 12 job types:

Job TypeTriggerMessages Produced
initial_importProject setupSYNC_START, CDN_SETUP
incremental_syncPush webhookREPO_PUSH_SYNC, CDN_MERGE
full_syncManual triggerSYNC_START, CDN_UPLOAD (per language)
source_syncSource language changeSYNC_START
bulk_translateBatch translation requestMultiple CDN_UPLOAD
publishSingle language publishPUBLISH_BATCH, CDN_UPLOAD
batch_publishMulti-language publishMultiple PUBLISH_BATCH
cdn_uploadDirect CDN writeCDN_UPLOAD
cdn_mergePartial CDN updateCDN_MERGE
cdn_setupCDN initializationCDN_SETUP
cdn_cleanupProject cleanupCDN_CLEANUP
glossary_syncGlossary updateGLOSSARY_SYNC

A single job can produce multiple messages. For example, a full_sync job on a project with 8 languages will produce 1 SYNC_START message followed by 8 CDN_UPLOAD messages — one for each language file. The job tracks the aggregate status across all its messages.


45+ Activity Actions: Structured Observability

Every message handler logs structured activity actions as it progresses. These are not free-text log lines — they are typed, structured events that power both the debugging experience and the real-time UI.

A typical SYNC_START flow produces this activity trail:

SYNC_STARTED
  → FETCH_FILES (fetching translation files from GitHub)
  → FILES_FETCHED (12 files found)
  → COMPARE_KEYS (diffing against database)
  → KEYS_ADDED (47 new keys)
  → KEYS_REMOVED (3 deprecated keys)
  → KEYS_UPDATED (12 modified values)
  → UPDATE_DATABASE (persisting changes)
  → PR_GENERATION_STARTED (creating translation PR)
  → PR_CREATED (PR #142 opened)
  → SYNC_COMPLETED (duration: 3.2s)

With 45+ distinct action types, you get granular visibility into every operation. When something fails, the last recorded action tells you exactly where the pipeline stopped and what data was already processed.

These activity actions also power the sync history UI. Your team can see every sync that has ever run, what it did, how long it took, and whether it succeeded — without touching server logs.


Conflict Detection and Resolution

Conflicts are the hardest problem in any sync system. Two people edit the same translation key — one in the codebase, one in the translation UI. Who wins?

Our answer: nobody wins automatically. The sync engine detects conflicts and surfaces them for human resolution.

Detection

During COMPARE_KEYS, the engine checks each incoming key against the database. If a key has been modified in both the repository and the database since the last successful sync, it is marked as a conflict. The engine stores both values along with their modification timestamps.

Resolution

Conflicts appear in the dashboard with full context:

  • The source value (from the repository)
  • The database value (from the translation UI)
  • The last synced value (the common ancestor)
  • Timestamps for each modification

Users can resolve conflicts one by one or in bulk, choosing to keep the source value, keep the database value, or write a manual merge. Every resolution is logged as an activity action.

This approach prevents the most common data loss scenario in translation workflows: a developer's code push silently overwriting a translator's carefully reviewed work.


Reliability Guarantees

The sync engine is designed around four reliability principles:

At-least-once delivery. Cloudflare Queues guarantees every message is delivered at least once. Messages survive worker restarts, deployments, and infrastructure failures.

Idempotent handlers. Since messages can be delivered more than once, every handler is idempotent. Re-processing a CDN_UPLOAD with the same content produces the same result. Re-processing a SYNC_START compares against the current database state, so duplicate syncs are effectively no-ops.

Ordered processing. Messages for the same project are processed in order. A CDN_MERGE always runs after the SYNC_START that produced it. This prevents race conditions where a CDN file is updated before the database reflects the new keys.

Automatic retries with backoff. Failed messages are retried with exponential backoff. Transient errors — API rate limits, network blips, temporary R2 unavailability — resolve themselves without human intervention. Permanent errors (invalid data, missing permissions) are logged and surfaced in the dashboard.


What This Means for Your Team

The sync engine runs in the background. You connect your GitHub repo, and syncs just work. Push code, and your translations are updated within seconds. Approve translations, and they are published to your CDN and committed to your repo atomically.

When something goes wrong — and in distributed systems, something always goes wrong — the engine retries, logs, and surfaces the issue. No silent failures. No inconsistent state. No lost translations.

That is the promise of async processing done right: your team focuses on translations, and the infrastructure handles the rest.