HuggingFace and ServiceNow Unveil MosaicLeaks: A New Benchmark for AI Agent Privacy
In a groundbreaking study published on the HuggingFace blog, researchers from ServiceNow and academic collaborators have unveiled MosaicLeaks—a new evaluation framework designed to test whether AI-powered research agents can keep secrets. The benchmark, which simulates real-world scenarios where agents access sensitive data, reveals that state-of-the-art models, including GPT-4o and Gemini 1.5 Pro, inadvertently leak private information up to 40% of the time. This finding sends a clear warning to developers deploying autonomous agents in enterprise environments where confidentiality is paramount.
What MosaicLeaks Tests and How It Works
The MosaicLeaks framework presents agents with a series of tasks that require them to process and summarize internal documents, emails, or databases—some of which contain explicitly marked confidential information. The agent is then asked to generate a public-facing report. The benchmark measures whether the agent inadvertently includes sensitive details, such as employee salaries, trade secrets, or unreleased product specs, in the final output. According to the HuggingFace blog post, the researchers used 500 test cases across three categories: enterprise communications, proprietary research, and personal identifiable information (PII).
The results are sobering. GPT-4o leaked secrets in 38% of test cases, while Gemini 1.5 Pro fared slightly better at 32%. Smaller open-source models like Llama 3.1 70B exhibited leak rates exceeding 45%. The researchers also discovered that simply prompting the agent with 'do not share confidential information' reduced leaks by only 5–10 percentage points, indicating that current alignment techniques are insufficient for this complex challenge.
Why This Matters for AI Developers and Enterprise Adoption
The MosaicLeaks research arrives at a critical time. Gartner predicts that by 2027, 60% of enterprises will deploy AI agents for internal knowledge retrieval and report generation. Yet, as these agents gain access to corporate wikis, HR databases, and financial records, the risk of accidental data exposure escalates. For developers, this means that building a 'research agent' is not just a matter of integrating a large language model (LLM) with a retrieval-augmented generation (RAG) pipeline—it requires a security-first design approach.
The implications extend beyond simple leaks. For example, an agent tasked with summarizing a quarterly review might inadvertently include a colleagues performance score that was marked as confidential. In regulated industries like healthcare or finance, such leaks could lead to compliance violations under GDPR, HIPAA, or SOC 2. The MosaicLeaks study provides a baseline for developers to test their own agents, but the researchers caution that the benchmark is not exhaustive.
Attack Vectors: Why Agents Leak Secrets
The HuggingFace blog post identifies three primary failure modes. First, 'contextual bleed-through' occurs when the agent does not properly separate public and private context windows. If a model processes a confidential email immediately before generating a public summary, it may drift into including sensitive specifics. Second, 'instruction forgetting' happens when longer conversations cause the model to drop earlier privacy constraints. Third, 'implicit inference' is the most insidious: the agent may not explicitly mention a trade secret, but its language—such as referring to 'the new quantum computing chip'—can still reveal its existence.
To combat these, the researchers propose several mitigation strategies. One is the use of constitutional AI-style rules that are reinforced at every turn of the dialogue. Another is the implementation of a 'privacy filter' layer that runs output through a secondary classifier trained to detect sensitive patterns. However, these have not yet been tested at scale, and the MosaicLeaks benchmark is open-source for the community to contribute improvements.
What Developers Should Do Now
For teams building research agents, MosaicLeaks serves as both a warning and a tool. Developers should integrate this benchmark into their CI/CD pipelines to test any agent that handles mixed-access data. The framework is available on GitHub as a Python package with configurable test suites for different industries. Additionally, the study recommends a tiered access control model: the agent should only receive documents with a minimum privacy classification, and any output should be audited by a differential privacy mechanism before release.
The open-source nature of MosaicLeaks is a deliberate choice by ServiceNow and HuggingFace. By making the test cases and evaluation scripts public, they invite the AI community to build upon the work. According to the blog, future versions will expand to include multi-agent scenarios where one agent might leak secrets to another via chain-of-thought reasoning.
The Bottom Line for AI Security
The MosaicLeaks study is a pivotal moment for AI security research. It moves the conversation from whether models can generate accurate answers to whether they can be trusted with sensitive information. For businesses, the answer is, not yet. But with targeted benchmarks like this, the path to secure agent deployment becomes clearer. Developers who ignore these findings risk more than just a bug—they risk data breach headlines of their own.
Source: HuggingFace Blog. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.