Build Your Own OCR Pipeline with Synthetic Data

The Agent Stack #022 — Monday Build

NVIDIA just dropped Nemotron OCR v2, and it’s not the model that matters—it’s how they built it. They generated millions of synthetic text images to train a multilingual OCR system that beats commercial APIs. Here’s how to steal their playbook.

The Synthetic Data Factory

Traditional OCR training requires massive datasets of real documents. Expensive, slow, and you’re stuck with whatever languages and fonts exist in your training set. NVIDIA flipped this: generate infinite training data instead.

Their pipeline creates synthetic images with realistic distortions, lighting, and backgrounds. The genius bit? They can control exactly what text appears, so they know the ground truth labels. No tedious manual annotation.

Here’s the core architecture you can build today:

import numpy as np
from PIL import Image, ImageDraw, ImageFont
import random
import cv2

class SyntheticOCRGenerator:
    def __init__(self, fonts_dir, backgrounds_dir):
        self.fonts = self._load_fonts(fonts_dir)
        self.backgrounds = self._load_backgrounds(backgrounds_dir)
    
    def generate_sample(self, text):
        # Create base image
        img = Image.new('RGB', (512, 256), 'white')
        draw = ImageDraw.Draw(img)
        
        # Random font and size
        font = random.choice(self.fonts)
        font_size = random.randint(16, 48)
        
        # Render text
        draw.text((20, 50), text, font=font, fill='black')
        
        # Add realistic distortions
        img_array = np.array(img)
        img_array = self._add_perspective(img_array)
        img_array = self._add_noise(img_array)
        img_array = self._add_lighting(img_array)
        
        return Image.fromarray(img_array), text
    
    def _add_perspective(self, img):
        h, w = img.shape[:2]
        # Random perspective transform
        src = np.float32([[0, 0], [w, 0], [w, h], [0, h]])
        dst = src + np.random.normal(0, 10, src.shape)
        M = cv2.getPerspectiveTransform(src, dst)
        return cv2.warpPerspective(img, M, (w, h))

The key insight: generate thousands of variations of the same text with different distortions. Your model learns to recognise characters under any conditions.

Training Loop That Scales

NVIDIA used hundreds of languages and scripts. You don’t need that complexity. Start with your target languages and expand:

def create_training_batch(generator, vocab, batch_size=32):
    batch_images = []
    batch_labels = []
    
    for _ in range(batch_size):
        # Generate random text from vocabulary
        text = random_text_from_vocab(vocab)
        image, label = generator.generate_sample(text)
        
        batch_images.append(preprocess_image(image))
        batch_labels.append(encode_text(label))
    
    return np.array(batch_images), np.array(batch_labels)

# Training loop
for epoch in range(num_epochs):
    for step in range(steps_per_epoch):
        images, labels = create_training_batch(generator, vocab)
        loss = model.train_step(images, labels)

The magic happens in random_text_from_vocab(). Mix real words with synthetic combinations. Include domain-specific terms if you’re targeting invoices or forms.

Why This Matters Now

Commercial OCR APIs cost £0.001-0.01 per page. Process 100,000 documents monthly and you’re spending £1,000+. A custom model trained this way costs £200 to train and runs for pennies.

More importantly, you control the data. No vendor lock-in, no privacy concerns, no rate limits. Train it on your specific document types and watch accuracy jump 20-30% over generic solutions.

NVIDIA’s full model handles 100+ languages. But their synthetic generation technique works just as well for English-only systems. The ROI calculation is simple: training cost vs API fees over 12 months.

Quick Hits

• RAM shortage: Memory prices up 40% this quarter, making inference optimisation critical. Quantise your models to 4-bit before deployment.

• Cursor raises £1.5B: Code generation tools are the new search engines. Build internal tools that use their API patterns—cursor-style autocomplete for domain-specific tasks.

• OpenAI kills Sora: Video generation isn’t dead, but consumer AI tools are getting culled. Focus on B2B applications where ROI is measurable.

One Thing to Try

Download the Nemotron OCR model from HuggingFace and benchmark it against your current OCR solution. Even if you don’t rebuild the training pipeline, you might find a free upgrade that saves thousands monthly in API costs.

Building beats buying when you control the training data.

The Synthetic Data Factory#

Training Loop That Scales#

Why This Matters Now#

Quick Hits#

One Thing to Try#

The Synthetic Data Factory

Training Loop That Scales

Why This Matters Now

Quick Hits

One Thing to Try