Beyond the PDF: Why Documents Are Failing AI
For decades, the humble document—PDF, Word file, webpage—has been the bedrock of information sharing. Yet as AI systems demand ever more structured, fluid, and tractable data, the document paradigm is showing its age. A new specification, the MMM Data Model (arXiv:2607.00032), proposes a radical alternative: a normative framework for a decentralised knowledge commons that treats knowledge as a network of interconnected, versioned statements rather than static pages. According to the authors, this shift could unlock new levels of interoperability for both human and machine agents.
What Is the MMM Data Model?
Published on arXiv, the MMM Data Model (short for “Minimal, Modular, and Mutable”) defines a formal data structure for representing knowledge in a way that is both machine-readable and human-friendly. The key innovations are threefold:
- Minimal: The core schema is tiny—just a few entity types (Assertion, Agent, Context, and Bundle)—to lower the barrier for adoption.
- Modular: Any piece of knowledge can be composed, decomposed, and repurposed across different contexts without duplication.
- Mutable: Statements are versioned and time-stamped, allowing updates without breaking references to earlier versions—a critical feature for live knowledge bases.
The model abandons the “document” as the primary unit. Instead, it uses atomic assertions (subject-predicate-object triples with metadata) bundled into graphs. These bundles can be cryptographically signed, stored on distributed ledgers, or shared via peer-to-peer networks, making them naturally suited for decentralised applications.
Why This Matters for AI Developers
Current AI pipelines choke on unstructured data. A typical RAG system must chunk, embed, and retrieve documents, often losing context and provenance. The MMM model offers a clean alternative: each assertion is a first-class citizen with its own URI, creation timestamp, and source attribution. For developers building knowledge-grounded LLM applications, this means:
- Fine‑grained provenance: Every fact can be traced back to its originator, reducing hallucination risk and enabling audit trails.
- Effortless updates: When a fact changes (e.g., a product price or a scientific result), only the relevant assertion needs updating, not an entire document corpus.
- Decentralised collaboration: Because bundles are self-contained and verifiable, multiple agents (humans or AIs) can contribute to the same knowledge commons without a central server.
The model also explicitly addresses knowledge interoperability—a term that has long frustrated developers trying to merge datasets from different sources. By standardising the “how” of knowledge packaging, MMM lets developers focus on the “what” of knowledge content.
Implications for Business and Data Strategy
For business professionals, the MMM model signals a move toward treating knowledge as an asset that can be composed, traded, and audited with the same rigour as financial data. Companies that currently hoard information in siloed document repositories could begin to expose their knowledge as MMM bundles, enabling:
- Faster integration of M&A data from multiple legacy systems.
- Automated compliance reporting, where each assertion carries its own regulatory context.
- Insights from combining internal datasets with public knowledge commons (e.g., scientific literature, regulatory filings) without re‑plicating monolithic documents.
The model’s emphasis on decentralisation also aligns with growing interest in data sovereignty. Organisations can maintain control over their assertions while still contributing to a broader commons—a balance that proprietary platforms have struggled to achieve.
Challenges and Open Questions
No real‑world standard succeeds on technical merit alone. The MMM model must compete with established formats like JSON‑LD, Schema.org, and the W3C’s Web Annotation Data Model. Its minimalism is a strength, but also a risk: too few constructs may force implementers to reinvent wheels for richer semantics. Moreover, achieving “widespread contribution” (the paper’s stated goal) requires tooling—editors, query engines, and indexing services—that doesn’t yet exist outside the research community.
The authors acknowledge these hurdles, positioning MMM as a “normative specification” intended to guide, not mandate, implementation. Early experiments will likely target scientific publishing (where versioning and attribution are critical) and decentralised AI agent networks (where trustlessness matters).
What Developers Can Do Right Now
The full paper (arXiv:2607.00032) includes a formal specification and worked examples. For developers interested in early experimentation:
- Read the specification to understand the assertion graph structure and versioning rules.
- Try converting a small RAG dataset into MMM bundles using the provided JSON-LD context.
- Evaluate how the model’s provenance features could simplify your current knowledge‑base auditing.
The MMM Data Model won’t replace documents overnight, but it offers a concrete path toward a commons where AI agents and humans share knowledge as natively interconnected, verifiable statements. That future may not be here yet, but the blueprint is now public.
Related: Closed-Loop AI Training: The New Paradigm for LLM Capability Enhancement
Source: Arxiv AI. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.