The Agent Stack #022 — Monday Build
NVIDIA just dropped Nemotron OCR v2, and it’s not the model that matters—it’s how they built it. They generated millions of synthetic text images to train a multilingual OCR system that beats commercial APIs. Here’s how to steal their playbook.
The Synthetic Data Factory
Traditional OCR training requires massive datasets of real documents. Expensive, slow, and you’re stuck with whatever languages and fonts exist in your training set. NVIDIA flipped this: generate infinite training data instead.
Their pipeline creates synthetic images with realistic distortions, lighting, and backgrounds. The genius bit? They can control exactly what text appears, so they know the ground truth labels. No tedious manual annotation.
Here’s the core architecture you can build today:
import numpy as np
from PIL import Image, ImageDraw, ImageFont
import random
import cv2
class SyntheticOCRGenerator:
def __init__(self, fonts_dir, backgrounds_dir):
self.fonts = self._load_fonts(fonts_dir)
self.backgrounds = self._load_backgrounds(backgrounds_dir)
def generate_sample(self, text):
# Create base image
img = Image.new('RGB', (512, 256), 'white')
draw = ImageDraw.Draw(img)
# Random font and size
font = random.choice(self.fonts)
font_size = random.randint(16, 48)
# Render text
draw.text((20, 50), text, font=font, fill='black')
# Add realistic distortions
img_array = np.array(img)
img_array = self._add_perspective(img_array)
img_array = self._add_noise(img_array)
img_array = self._add_lighting(img_array)
return Image.fromarray(img_array), text
def _add_perspective(self, img):
h, w = img.shape[:2]
# Random perspective transform
src = np.float32([[0, 0], [w, 0], [w, h], [0, h]])
dst = src + np.random.normal(0, 10, src.shape)
M = cv2.getPerspectiveTransform(src, dst)
return cv2.warpPerspective(img, M, (w, h))
The key insight: generate thousands of variations of the same text with different distortions. Your model learns to recognise characters under any conditions.
Training Loop That Scales
NVIDIA used hundreds of languages and scripts. You don’t need that complexity. Start with your target languages and expand:
def create_training_batch(generator, vocab, batch_size=32):
batch_images = []
batch_labels = []
for _ in range(batch_size):
# Generate random text from vocabulary
text = random_text_from_vocab(vocab)
image, label = generator.generate_sample(text)
batch_images.append(preprocess_image(image))
batch_labels.append(encode_text(label))
return np.array(batch_images), np.array(batch_labels)
# Training loop
for epoch in range(num_epochs):
for step in range(steps_per_epoch):
images, labels = create_training_batch(generator, vocab)
loss = model.train_step(images, labels)
The magic happens in random_text_from_vocab(). Mix real words with synthetic combinations. Include domain-specific terms if you’re targeting invoices or forms.
Why This Matters Now
Commercial OCR APIs cost £0.001-0.01 per page. Process 100,000 documents monthly and you’re spending £1,000+. A custom model trained this way costs £200 to train and runs for pennies.
More importantly, you control the data. No vendor lock-in, no privacy concerns, no rate limits. Train it on your specific document types and watch accuracy jump 20-30% over generic solutions.
NVIDIA’s full model handles 100+ languages. But their synthetic generation technique works just as well for English-only systems. The ROI calculation is simple: training cost vs API fees over 12 months.
Quick Hits
• RAM shortage: Memory prices up 40% this quarter, making inference optimisation critical. Quantise your models to 4-bit before deployment.
• Cursor raises £1.5B: Code generation tools are the new search engines. Build internal tools that use their API patterns—cursor-style autocomplete for domain-specific tasks.
• OpenAI kills Sora: Video generation isn’t dead, but consumer AI tools are getting culled. Focus on B2B applications where ROI is measurable.
One Thing to Try
Download the Nemotron OCR model from HuggingFace and benchmark it against your current OCR solution. Even if you don’t rebuild the training pipeline, you might find a free upgrade that saves thousands monthly in API costs.
Building beats buying when you control the training data.