IBM and HuggingFace Introduce ScarfBench to Measure AI Agent Performance in Java Framework Migration
IBM Research and HuggingFace have released ScarfBench, a specialized benchmarking suite designed to evaluate how well AI agents handle the complex, real-world task of migrating enterprise Java applications from one framework to another. Announced via HuggingFace's official blog, ScarfBench directly addresses a critical gap in AI agent evaluation: the absence of standardized, enterprise-grade benchmarks for legacy code modernization.
What ScarfBench Measures
ScarfBench focuses on migrating Java applications between two major enterprise frameworks: from Jakarta EE (formerly Java EE) to Quarkus and from Spring Boot to Quarkus. The benchmark includes 250 migration tasks sourced from real-world enterprise applications. Each task requires an AI agent to understand the source code, identify framework-specific API calls, annotations, and configuration patterns, and produce a functionally equivalent target codebase.
The benchmark goes beyond simple code translation. It evaluates agents on four key dimensions: correctness of the migrated code, adherence to target framework conventions, handling of dependency injection and lifecycle management, and test generation to verify functional equivalence. Agents are scored on pass/fail rates against a hidden test suite, with the best-performing agents achieving a 72% pass rate on Jakarta EE migrations but only 39% on Spring Boot to Quarkus tasks.
Why This Matters for AI Developers
According to the HuggingFace announcement, the variability in agent performance across frameworks reveals a critical insight: current AI agents struggle most with framework-specific idioms and patterns that are not well-represented in their training data. Spring Boot, being more widely used in documentation and forums, still posed significant challenges for agents when migrating to Quarkus—a less documented but increasingly popular choice for cloud-native Java applications.
For AI developers building code generation tools, ScarfBench provides a much-needed reality check. The benchmark files include complete project structures, build scripts, and test cases, making it reproducible across different agent architectures. Early results from the blog indicate that chain-of-thought prompting and retrieval-augmented generation (RAG) with framework-specific documentation yield the best results, but no single approach dominates across all migration types.
Implications for Enterprise Software Modernization
For businesses running legacy Java applications, the ability to automatically migrate from older frameworks like Jakarta EE to modern, cloud-optimized ones like Quarkus could dramatically reduce the time and cost of modernization. Traditionally, such migrations are manually intensive, requiring deep expertise in both source and target frameworks. An AI agent capable of automating even a portion of this work would be a significant productivity multiplier.
However, the sub-40% pass rate on Spring Boot to Quarkus migrations signals that AI agents are not yet ready for unsupervised, production-grade refactoring. Enterprises should view these tools as assistants for developers—flagging migration issues, suggesting code transformations, and generating test cases—rather than replacements for human judgment.
Technical Architecture of ScarfBench
The benchmark implementation draws on HuggingFace's ecosystem of datasets, models, and evaluation tools. Each migration task is packaged as a structured dataset with source code, specification files, and a hidden test suite. Agents interact with the benchmark through a standardized API that allows them to read files, execute commands, and submit changes. This design enables fair comparison across different AI architectures, including both API-based models and locally hosted open-source models.
The choice to focus on Java is strategic. Java remains the dominant language for enterprise backend systems, with an estimated 9 million developers worldwide. Framework migration is one of the most common and painful modernization tasks, making it an ideal test case for AI agent capabilities.
Key Takeaways for Developers
- ScarfBench is available now on HuggingFace under an Apache 2.0 license, making it free for commercial and research use.
- The benchmark currently supports two migration paths: Jakarta EE to Quarkus and Spring Boot to Quarkus.
- Best-performing agents use RAG with framework documentation and chain-of-thought prompting.
- Agent success rates vary from 72% (Jakarta EE → Quarkus) to 39% (Spring Boot → Quarkus), indicating significant room for improvement.
- The benchmark suite includes 250 tasks with test cases, enabling reproducible evaluation.
Looking Ahead
ScarfBench is a welcome addition to the growing ecosystem of AI agent benchmarks. Unlike simpler code generation tasks (e.g., writing sorting algorithms or API endpoints), framework migration demands deep understanding of both source and target ecosystems, including dependency management, configuration, and build tooling. As AI models continue to improve in reasoning and code understanding, benchmarks like ScarfBench provide the necessary evaluation rigor to ensure progress is meaningful in real-world contexts.
For organizations evaluating AI agents for software modernization, ScarfBench offers a practical starting point. The benchmark's reproducibility and domain-specificity may also inspire similar benchmarks in other enterprise languages and frameworks, such as .NET migration or legacy COBOL modernization, where the need is equally pressing.
Source: HuggingFace. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.