Interfaze: A new model architecture built for high accuracy at scale

tl;dr: Interfaze is a new model architecture that outperforms models like Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, and Grok-4.3 across 9 head-to-head benchmarks in OCR, vision, STT, and structured output.

Humans are inefficient at computer-level tasks. We make mistakes, but we're great at decision-making and understanding nuance.

Imagine telling a human to read a 50-page PDF, map every word to another document with its XY position, and translate the whole thing into Chinese. You'd get tons of mistakes, pay a lot to keep that human on payroll, and wait a long time for the result.

Transformer models are similar. They're amazing at nuance and human-level tasks, and they make mistakes like a human, but that's also what keeps them creative.

We've been using the wrong models for the wrong tasks.

CNNs/DNNs have existed since the early 90s, from LeNet-5 to ResNet, and more recently CRNN-CTC.

These are deep neural network architectures that are task-specific for things like OCR, translation, or GUI detection. The way they consume and see data is trained to be task specific, which makes them up to 100x more accurate at their specific task. They also produce useful metadata like bounding boxes and confidence scores, letting developers build predictable workflows they can rely on.

So why do so many of us still go for transformers/LLMs for deterministic tasks?

DNNs are not flexible. They're only as good as their training data, and they aren't great at human-level nuance.

They might be cheap to serve but expensive to maintain and retrain for new tasks. Take a passport: a CNN can extract the date of birth with bounding boxes and a confidence score, but it can't calculate the person's age.

A new model architecture that merges the specialization of DNN/CNN models with omni-transformers, giving you the best of both worlds.

That means high accuracy and low cost on deterministic tasks:

While Pro tier models like Claude Opus 4.7 and GPT 5.5 are the best generalist models in the market today for things like coding and complex reasoning tasks, they aren't commonly used for high volume tasks like OCR or translation due to high cost and slow response times.

Interfaze is benchmarked against models in similar pricing tiers and feature sets that are optimized to squeeze the most performance out of the model at the fastest speed, while keeping cost low at scale.

Today, most people reach for two model categories for deterministic developer tasks:

↓ = lower is better (word error rate). — = not scored (model has no native audio input). All other rows: higher is better.

Each model is compared head-to-head across nine benchmarks: OCRBench V2, olmOCR, RefCOCO, VoxPopuli-Cleaned-AA, SOB Value, Spider-2.0-Lite, GPQA Diamond, MMMLU, and MMMU-Pro.

View the full leaderboard →

Interfaze leads in almost every benchmark, against both specialized models in each category and the generalist flash/mini models.

Our goal isn't to replace LLMs. It's to specialize in deterministic tasks. The benchmarks focus on categories like OCR, object detection, and structured output, with a few general benchmarks like GPQA Diamond to show the level of problem-solving and understanding you'd expect from any transformer model.

Interfaze is priced in a similar range as Gemini-3-Flash, at $1.50 per million input tokens and $3.50 per million output tokens.

Our number one use case from users has been OCR for images and complex, long PDFs.

Interfaze outperforms OCR providers like Chandra OCR and Reducto, and generalist models like Gemini-3-Flash and GPT-5.4-Mini.

It isn't just the task-specific CNN encoder doing a good job. It's the ability to lean on object detection for figures and graphics, or lean on the translation layers of the transformer all in a shared vector space.

Most LLMs today are great at following a JSON schema, but pretty bad at filling it with accurate values. No public benchmark measures the accuracy of those values, so we released SOB (the Structured Output Benchmark) last week.

TL;DR: SOB gives the model the correct answer in its context, then asks it to generate a JSON output with data it already has. We measure who is the most accurate, with the fewest mistakes and hallucinations, across text, image, and audio modalities (all normalized to text).

Compared against the same flash/mini set used throughout this post. See the full SOB leaderboard for all 28 models, including frontier Pro-tier models like Gemini-3.1-Pro, GPT-5.5, and Claude-Opus-4.7.

There's still huge room for improving structured output without raising cost or compute. Follow us on X or LinkedIn to follow our research journey.

Interfaze has great multilingual performance across a wide range of languages.

On VoxPopuli-Cleaned-AA, Interfaze comes in second on word error rate.

Interfaze transcribes 209 seconds of audio per second of compute, ~1.5× faster than Deepgram Nova-3, ~8× faster than Scribe v2, and over 11× faster than Gemini-3-Flash.

Interfaze speaks the Chat Completions API standard, so any AI SDK that supports OpenAI works out of the box: just point it at https://api.interfaze.ai/v1. Grab your API key from the Interfaze dashboard and drop it in.

The same interfaze client is reused in every example below.

A magazine page with dense multi-column text and three illustrations. Interfaze runs OCR and object detection on the same image in one request, returning the full text plus pixel-coordinates for every figure, all under your schema.

object carries the schema response: full page text plus a graphic_objects array with a description and pixel coordinates for each illustration. precontext carries the raw OCR (per-line and per-word bounding boxes, confidence scores) on the same response.

With our hybrid architecture, you can activate parts of the model to run a specific task without using the full weights.

It's faster and cheaper, with some tradeoffs, you get a fixed structured output that's deterministic and consistent on every run, and you can only run one task per request.

Using the <task> tag in the system prompt, you control which part of the model activates. Below, we run pure OCR on a handwritten poem.

The response is the raw task result with name and result, ready to consume directly.

Interfaze comes built in with its own web index from scraping multiple SERP indexes and our own crawler.

object returns the enriched profile typed exactly to the schema, while precontext includes the raw web search results Interfaze pulled in to ground the answer.

The clip below is 1 hour 35 minutes of a podcast episode. Interfaze transcribes it in ~50 seconds with per-chunk timestamps.

The response is the raw task result as shown below.

We're excited to keep experimenting, growing and discovering new research that makes deterministic AI more efficient and accessible to all developers!

Get started for free and try it on your own documents, images and prompts. We're excited to see what you build!