Building a scalable document management system: Lessons from separating metadata and content

mercredi 19 novembre 2025, 10:09 , par InfoWorld

When I was tasked with modernizing our enterprise document management system, I knew the stakes were high. We were managing millions of documents, and our legacy architecture was buckling under the load. Query times were unpredictable, scaling was a constant battle and costs were spiraling out of control.

The breakthrough came when I stopped thinking about documents as monolithic entities and started viewing them as two distinct workloads: high-frequency metadata queries and low-frequency content operations. This architectural shift transformed our system from a performance bottleneck into a horizontally scalable, cost-efficient platform that consistently delivers sub-300ms query times even under heavy load.

The core insight: Not all data is created equal

The fundamental problem with traditional document management systems is that they treat metadata and content as a single unit. When a user searches for “all employment contracts for customer X,” the system must wade through both the searchable attributes and the heavyweight file content, even though the search only needs the metadata.

I realized that these two types of data have completely different performance characteristics. Metadata operations are classic Online Transaction Processing (OLTP) workloads: frequent, small, latency-sensitive transactions. Content operations are the opposite: infrequent, large, bandwidth-intensive transfers that can tolerate higher latency.

By separating these workloads, I could optimize each independently. Metadata went into a high-performance NoSQL database — choosing from options like Amazon DynamoDB, Google Cloud Firestore or Azure Cosmos DB based on your cloud provider — configured in on-demand mode for automatic scaling. Document content lives in commodity cloud object storage such as Amazon S3, Google Cloud Storage or Azure Blob Storage. Each document gets a globally unique identifier (unique document ID) that serves as the link between its metadata record and its file in object storage.

The performance impact was immediate. Our metadata queries dropped from seconds to ~200 milliseconds at the median, with 95th percentile latency under 300ms. More importantly, these numbers stayed consistent even as we scaled to handle over 4,000 requests per minute with zero errors.

Making the API layer do the heavy lifting

The separation of metadata and content only works if you enforce it architecturally, which is where an API-first design becomes critical. I created distinct REST endpoints for metadata operations versus content operations. When a client queries “show me all documents for member 12345,” that request hits the metadata API, retrieves the results from the fast NoSQL database and returns them without ever touching object storage.

Here’s how I structured the API:

Metadata APIs:

GET /api/v1/members/{member-id}/document-metadata: Retrieve all document metadata for a specific member

GET /api/v1/members/{member-id}/document-metadata/{unique-document-id}: Retrieve metadata for a single document

PATCH /api/v1/members/{member-id}/document-metadata/{unique-document-id}: Update metadata for a single document

Document APIs:

POST /api/v1/members/{member-id}/documents: Upload a new document

GET /api/v1/members/{member-id}/documents/{unique-document-id}: Retrieve a document

DELETE /api/v1/members/{member-id}/documents/{unique-document-id}: Delete a document

The content is only fetched when explicitly requested by a unique document ID. This explicit separation prevents the system from accidentally coupling the two workloads. It also creates a stable contract that allows the frontend and backend to evolve independently.

Beyond performance, the API layer became our enforcement point for security, rate limiting and request validation. We implemented OpenID Connect Authorization Code Flow with Proof Key for Code Exchange (PKCE) for user facing single-page application (SPA) and OAuth 2.0 Client Credentials Flow for machine-to-machine (M2M) communication, keeping the architecture cloud-agnostic by relying on standards rather than proprietary identity services. Authorization is enforced using Role-Based Access Control (RBAC), with the API layer serving as the policy enforcement point that validates a user’s roles and permissions before granting access to the underlying data stores.

All communication uses TLS 1.2 or higher. Both the metadata datastore and object store have server-side encryption enabled. By relying on standardized protocols and universally available features rather than provider-specific key management services, the architecture remains portable across cloud platforms.

The data modeling decisions that matter

Getting the NoSQL data model right was crucial. Instead of storing verbose strings like “Legal” or “Employment Contract” in every metadata record, I used numeric identifiers (e.g., 101 for “Legal,” 10110 for “Employment Contract”) that reference a separate, cached category table. This normalization reduced storage costs and made updates trivial. Want to rename a category or add multi-language support? Update the reference table once instead of millions of document records.

Here’s what a typical metadata record looks like:

{
'unique_document_id': 'aGVsbG93b3JsZA==',
'member_id': '123456',
'file_name': 'employment_contract.pdf',
'document_category_id': 101,
'document_subcategory_id': 10110,
'document_extension': '.pdf',
'document_size_in_bytes': 245678,
'date_added': '2025-09-20T12:11:01Z',
'date_updated': '2025-09-21T15:22:00Z',
'created_by_user_id': 'u-01',
'updated_by_user_id': 'u-02',
'notes': 'Signed by both parties'
}

For query patterns, I leveraged secondary indexes aggressively. While the primary table uses the unique document ID as its key, a secondary index organized by member ID and document category enables efficient queries like “retrieve all documents of a certain category for a given member” without expensive table scans.

The schema-on-read model of NoSQL proved invaluable for evolution. When we needed to add a new optional metadata field, there was no risky ALTER TABLE statement or downtime. New documents simply started including the attribute, while existing documents continued working without it. This agility allowed us to respond to new requirements in hours instead of weeks.

Building in disaster recovery and data resiliency

A comprehensive disaster recovery strategy was essential for business continuity. I incorporated resiliency at both the metadata and content layers.

For the metadata store, I enabled Point-in-Time Recovery (PITR), a feature available across major managed NoSQL services. PITR continuously backs up data, allowing the database to be restored to any specific second within a retention window (typically 7 to 35 days). This protects against logical data corruption from accidental writes or deletions.

For document content, I implemented object versioning and cross-region replication. Versioning preserves previous versions of files, protecting against accidental overwrites or deletions. Cross-region replication automatically copies objects to a storage in a different geographical region, ensuring data availability even during a regional outage.

This multi-layered approach to disaster recovery — combining PITR for metadata with versioning and replication for content — provides robust protection against a wide range of failure scenarios.

Managing the long-term lifecycle

One critical design decision was how to handle deletions. Rather than flagging records as deleted in the active table, I implemented an “archive-on-delete” pattern. When a document is deleted, its metadata record moves entirely to a separate archive table, while the file transitions to low-cost archival storage tiers like Amazon S3 Glacier, Azure Blob Storage Archive or Google Cloud Storage Archive.

This approach keeps the active metadata table lean and fast by containing only active records. It also drives significant cost savings — archival storage costs a fraction of standard storage, and we’re not paying for hot database capacity to store cold data. All major cloud providers support these archival tiers with similar pricing models, so the pattern remains cloud-agnostic.

The trade-offs worth making

No architecture is perfect, and this one embraces eventual consistency as a deliberate trade-off for horizontal scalability. In practice, this means there’s occasionally a sub-second window where a newly written document might not immediately appear in a query if it hits a different database replica.

For our use case — a document management system where humans are the primary users — this delay is imperceptible. For the rare programmatic workflows requiring immediate read-after-write consistency, NoSQL services typically offer strongly consistent reads on demand, though at slightly higher latency and cost.

This architecture isn’t ideal for every scenario. If you need complex, ad-hoc transactional queries across many metadata attributes, a traditional relational database might be better. But for high-volume content repositories, digital asset management and customer-facing applications with read-heavy workloads, the scalability and cost benefits are transformative.

The results: Performance and economics

The performance testing validated the architectural decisions. Under sustained load exceeding 4,000 requests per minute, our metadata retrieval API (returning 214 documents per member) maintained excellent performance:

Table 1. Performance metrics under sustained load

MetricObserved ValueNotesThroughput4,000 requests/minSustained throughput under auto scaled loadMedian Latency (p50)~200 msMetadata retrieval for 214 documents per member95th Percentile Latency (p95)

Lire la suite sur InfoWorld