Making LLM Training Faster with Unsloth and NVIDIA

We collabed with NVIDIA to make LLM training ~25% faster and in this blog/guide we'll breakdown exactly how we did it. These optimizations have no loss in accuracy and are an extra addition on top of Unsloth’s already 2-5x faster speedup! The new algorithms are auto enabled on RTX laptops, data center GPUs and DGX Spark machines, so just update Unsloth to get the latest improvements. By working with NVIDIA, we show how:

Instead of padding all of them to the same length and wasting compute on padding tokens, we concatenate them into one longer packed sequence:

The model still needs to know where each original sequence starts and ends. So, alongside the packed tokens, we carry sequence metadata such as:

This is the key point: for a fixed packed batch, that metadata is the same for every layer.

If we write the boundary information for a packed batch as:

B = { lengths, cu_seqlens, max_seqlen, mask structure }

then every transformer layer in that forward pass consumes the same B.

If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.

build B + build B + ⋯ + build B (L times)

The overhead here is not primarily extra FLOPs. Some of these paths can force device-to-host synchronization, effectively creating a GPU-CPU sync point. Once that happens inside a per-layer path, the overhead recurs at every layer.

That is what the packed-sequence caching change reduces. Instead of repeatedly reconstructing packed sequence info, SDPA packed masks, and xFormers block masks, it caches the reusable metadata and the attention-side structures derived from it, per device, for the current packed batch. Those cached structures are then reused across layers.

Packed training already improves utilization by eliminating padding waste. But if the metadata path keeps forcing synchronization, some of that gain is lost to overhead that has nothing to do with the model's actual learning.

Caching helps because it removes repeated coordination work from the hot path. The forward pass benefits the most because that is where the same packed metadata is consumed repeatedly across many layers.

The forward pass sees the biggest benefit because repeated metadata and mask preparation show up most directly there. Backward also improves, but the effect is smaller. The time saved is similar, but the backward pass, especially with gradient checkpointing, takes longer, so the relative gains appear smaller.

Now that we know the measured gain, we can ask a simpler question: does that scale make sense?

If we assume each layer is roughly similar, we can model the packed-attention path as:

With caching, that repeated overhead is paid once for the batch instead of once per layer:

For the packed SDPA path, our microbenchmark on NVIDIA Blackwell GPUs showed that the low-level, host-visible metadata calls were real but small, at about 0.2 ms each. The dominant repeated cost was the packed SDPA mask-construction path itself, which measured about 13.7 ms for a synthetic packed batch with 2048 total packed tokens.

For the SDPA backend, a better mental model is:

small stream fence + mask rebuild ≈ mask rebuild

That lets us do a cleaner consistency check. If one packed-mask rebuild costs m milliseconds, then under a uniform-layer model:

Smaller packed-sequence runs showed the same pattern:

Those percentages are relative to full training step time, so they still include work outside the packed-attention path, such as embeddings, the MLP, the LM head, the loss, and framework overhead. This estimate is intentionally only about the packed-attention side of the block, not the whole transformer layer. It is there only to check that the measured gains are in the right range for the packed SDPA path.

Activation checkpointing is a standard technique for training large models. The idea is to save memory by not keeping every intermediate activation alive through the backward pass. In exchange, we pay for some extra work during backward.

That trade-off is usually worth it, especially for larger models.

But it raises another systems question: if an activation has been offloaded, how does it get back to the GPU for backward?

In Unsloth's smart checkpointing path, activations can be staged in pinned CPU memory and copied back when needed. That saves VRAM, but it can introduce a bottleneck:

That is a serialization pattern. If one buffer is reused for both copy and compute, the copy stream and the compute stream keep taking turns.

Let T_copy be the activation reload time and T_compute be the backward compute time for the current layer.

With a single buffer, this part of the step is roughly limited by:

That is the serialized case. We pay for both almost entirely, one after the other.

A cleaner way to handle this is to use two buffers.

While the backward pass is running on buffer A, the copy stream can preload the next activation into buffer B. Then the roles swap. That creates pipeline overlap, though not perfect overlap.

Double buffering does not reduce the amount of math. It hides copy latency behind useful compute.

This kind of optimization tends to get stronger once the model is large enough that backward compute is substantial, but not so dominant that all copy overhead disappears into noise. For larger models, higher hidden dimensions mean more data movement, so hiding that movement has a larger impact. Larger models also tend to have more layers, which creates more opportunities to hide copies behind computation.

That is why larger dense models are a good fit for this improvement. The GPU has enough real work going on that the copy can overlap with it, and the extra VRAM needed for the second buffer stays modest.

The implementation also keeps practical guardrails in place:

On the larger dense-model runs, benchmarked with NVIDIA B200 Blackwell GPUs:

In these runs, final losses were effectively unchanged.

The speedup is consistent across larger dense models, and the extra VRAM cost stays relatively small.

Once we know the measured gain, the natural follow-up is: does the scale make sense?

If we assume there are L checkpointed layers and each layer is roughly similar:

This also scales with batch size, sequence length, and other factors that affect data movement and computation. We omit those terms for brevity.

With two buffers, the first layer still has to wait for its activation to arrive, and the last layer still has to finish computing. So a better approximation is:

This is the useful reading of the result:

If the overlap is good, the per-layer cost in the middle gets much closer to:

From the measured larger-model results, the saved time per training step is roughly:

These host buffers are pinned allocations, so the relevant bandwidth is measured pinned-memory host-to-device bandwidth, not pageable-memory bandwidth. On our NVIDIA B200 Blackwell-based system, that bandwidth was about 55.7 GB/s, with 64 GB/s as a useful PCIe ceiling for comparison.

If we use the extra buffer size as a rough proxy for one activation reload, then each reload is naturally on the order of only a few milliseconds:

To explain the observed saved time per step, we would need to hide roughly a few dozen such reloads:

Hiding one such reload across a few dozen checkpointed layers lands in the few-hundred-millisecond range of saved step time, which is exactly the scale we observed.

Again, that saved time is part of the full end-to-end training step. It is not supposed to explain embeddings, the LM head, the loss, optimizer work, or every other non-checkpointed part of the step. The point is only that the communication we can hide is large enough to plausibly account for the measured step-time gains.

The third change is more specialized, but it shows the same pattern in MoE routing.

In the PyTorch-based GPT-OSS MoE path we examined, one expensive part of routing is figuring out which tokens go to which expert. A naive implementation can do something like:

At first glance, this looks harmless. But torch.where is data-dependent here: the number of tokens routed to each expert changes from batch to batch. This can introduce CPU-GPU synchronization or related runtime overhead because output sizes depend on the routing pattern. If this happens once per expert, the number of dynamic queries scales with num_experts.

The better approach is to group everything once:

Mathematically, the gain is not that we changed the routing logic. We changed how often we asked the runtime to answer a dynamic indexing question.

because we do one dynamic query per expert, we move much closer to:

This is the same theme in a more specialized setting: group once, then reuse offsets instead of repeatedly asking for dynamic token lists.

Note that these optimizations apply to any MoE using the native_torch backend.

For this GPT-OSS-specific routing improvement:

Even though these three optimizations live in different parts of the stack, they are solving the same problem.

The key optimization opportunities were in the glue code around the main kernels:

This is also why the improvements compose conceptually. As the main kernels get faster, overhead that used to be invisible starts becoming a meaningful fraction of the total step time.

There is a useful engineering lesson here. Once the math kernels are optimized, "faster" often means one of two things: