Skip to main content
AI Jun 11, 2026 5 min read 7 views

HuggingFace Blog Shows How to Fuse MLP Layers in PyTorch for Up to 30% Speed Gains

PyTorch HuggingFace MLP fusion kernel optimization transformer inference torch.compile
HuggingFace Blog Shows How to Fuse MLP Layers in PyTorch for Up to 30% Speed Gains
HuggingFace blog demonstrates fused MLP kernels in PyTorch, cutting transformer inference time by up to 30% with torch.compile and Triton. Developers

The Latest PyTorch Profiling Guide from HuggingFace

According to a HuggingFace blog post published this week, developers can now achieve up to 30% inference speed improvements on transformer models by fusing multiple MLP layers into a single, optimized kernel. The post, titled "Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP," provides a step-by-step guide on identifying performance bottlenecks in multi-layer perceptron blocks, then using PyTorch's torch.compile and custom CUDA kernels to eliminate them.

The guide builds on HuggingFace's earlier profiling series, which taught developers how to use PyTorch Profiler and Chrome trace visualization. Part 2 goes further by showing exactly how to spot the familiar pattern of repeated nn.Linear operations followed by activation functions—a structure found in nearly every modern transformer's feed-forward network (FFN).

What the Blog Demonstrates

The HuggingFace team walks through profiling a standard BERT-style MLP block containing two linear layers with a GELU activation in between. Using PyTorch's built-in profiler, they show how each individual kernel launch creates overhead from memory reads, writes, and scheduler synchronization. The solution they propose is a fused MLP kernel that combines both linear operations and the activation into a single GPU kernel, dramatically reducing launch overhead.

Benchmark results shared in the post indicate that the fused kernel achieves:

  • Up to 30% reduction in wall-clock time for MLP forward passes on an NVIDIA A100 GPU
  • Near-zero additional memory consumption compared to the unfused version
  • Compatibility with mixed-precision training and inference (fp16/bf16)
  • Seamless integration via PyTorch's torch.compile backend with custom Triton kernels

Why This Matters for AI Developers and Businesses

The results are significant because transformer models spend roughly two-thirds of their computation in these MLP layers. In a model like GPT-3 or LLaMA, the FFN blocks account for over 60% of total FLOPs. Reducing the time spent in these layers by 30% directly translates into faster inference and lower operational costs for businesses deploying AI at scale.

For developers working on latency-sensitive applications—such as real-time chatbots, code completion tools, or speech processing systems—the fused MLP approach could mean the difference between a 50ms response and a 35ms response. That 15ms savings might not sound large, but in high-throughput environments handling thousands of requests per second, the cumulative effect on server load and electricity bills is substantial.

Technical Breakdown: How Fusing Works

The core insight is straightforward: instead of launching separate kernels for each linear layer and activation function, the fused kernel executes the entire MLP forward pass in one go. It does this by reading the input once, computing the first linear transformation, applying GELU, computing the second linear transformation, and writing the output—all inside a single Triton kernel. This eliminates redundant memory reads and kernel launch overhead that plague multi-kernel pipelines.

HuggingFace also shows how to use PyTorch's torch.compile with the inductor backend to automatically detect and fuse these patterns without manual kernel writing. For developers who prefer more control, they provide a Triton kernel template that can be adapted to any MLP architecture.

Practical Implications for Production Deployments

For AI engineering teams, this means they no longer need to choose between writing complex CUDA code or leaving performance on the table. The fuse approach works with existing HuggingFace transformer models and requires only minor changes to the model definition. The blog includes code snippets showing how to replace a standard MLP module with a fused version that is drop-in compatible with the transformers library.

However, the post also warns that fusing is not a silver bullet. For small batch sizes or very short sequences, kernel launch overhead is minimal anyway, so the benefits shrink. Developers should profile their specific workloads before committing to fused kernels. The HuggingFace team recommends always verifying speedups with the PyTorch Profiler after making changes.

Broader Context: The Race for Inference Efficiency

This blog post arrives at a time when the entire AI industry is focused on reducing inference costs. Companies like OpenAI, Anthropic, and Google DeepMind are all investing heavily in kernel fusion, quantization, and sparse attention to run models faster on existing hardware. By publishing these techniques openly, HuggingFace continues its mission to democratize access to state-of-the-art optimization tools.

For startups and mid-size AI companies without dedicated infrastructure teams, the fused MLP kernel offers a relatively low-effort way to cut costs without sacrificing accuracy. The blog also serves as a reminder that significant performance gains are often available through careful profiling and targeted optimization—not just buying more GPUs.

What Developers Should Do Next

The HuggingFace team encourages developers to run the profiling and benchmarking scripts provided in the blog on their own models and hardware. They note that optimal kernel fusion strategies vary by GPU architecture (e.g., A100 vs. H100 vs. RTX 4090), so local benchmarking is essential. They also promise a follow-up post covering fusion for attention mechanisms, which could unlock even larger speedups for models like GPT-4 and LLaMA-2.

For now, the fused MLP technique is available as a self-contained class in the blog's repository, ready for developers to copy directly into their projects. Given the clear performance benefits and minimal integration effort, it is likely to become a standard component in many PyTorch-based production pipelines by mid-2026.

Source: HuggingFace Blog. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.

Avatar photo of James Whitfield, contributing writer at AI Herald

About James Whitfield

James Whitfield is a senior software engineer with 8 years of experience building developer tools, CLI applications, and IDE extensions. He has contributed to open source projects including VS Code extensions and GitHub Actions workflows. Currently covers AI developer tools, coding assistants, and platform engineering for AI Herald.

Related articles