Using Cosmos DB in Microsoft Fabric

jeudi 28 août 2025, 11:00 , par InfoWorld

After announcing support for Cosmos DB inside the Microsoft Fabric data platform at Build 2025, Microsoft recently opened a public preview of the new service. This release gives Cosmos DB a new role in the wider scope of Microsoft’s data estate, adding it to Fabric’s roster of operational data sources.

Services like Microsoft Fabric have changed how we work with operational data, going beyond the sparse relational model of the data warehouse to encapsulate untransformed data in data lakes and support structured, unstructured, and relational data alongside NoSQL. With Fabric, if it can store your data, you can use it to analyze your data.

Microsoft is positioning this tool to provide the at-scale grounding needed to support enterprise AI applications, using Fabric’s data agents as a platform for delivering that data from multiple data sources. Operational data is key to delivering value from AI, and the more you have, the easier it is to tune models and provide the necessary sources to deliver accurate answers.

Cosmos DB meets Fabric

Adding Cosmos DB to Fabric brings the last of Microsoft’s big data sources inside its data engineering platform, where it’s possible to use Fabric Agents to build queries that work across multiple tables in different formats. By putting Cosmos DB inside Fabric, you’re able to use its different APIs to manage complex data types and still query them through familiar data science tools, including Python-based interactive notebooks.

It’s important to note that this is not embedding the standalone PostgreSQL-based Cosmos DB variant DocumentDB inside Fabric (although Microsoft could have easily done so). Rather, it’s a new way to use the full service that allows you to take advantage of its scalability and high-availability features while still supporting the Fabric lakehouse analytics tools. The intent is to add support for semistructured data to Fabric so that you can use it to analyze the data from your existing applications that use Cosmos DB, and from future code that wants to take advantage of its capabilities.

Working with vector indexes

Fabric-hosted Cosmos DB stores can take advantage of features like its built-in vector indexing tools. These add vector representations of your documents alongside the data, simplifying the overall management of both elements and making searches more efficient as data and vector are stored together. You can choose which indexing technique your database uses, from flat (suitable for small stores with closely related content) to Microsoft Research’s DiskANN, which is optimized for large stores with lots of data that needs to be indexed and retrieved quickly.

You can use the same query tools to search vector indexes as well as the rest of your data, giving you the option to search based on similarities in your data or by exact matches. This approach is similar to how large-scale search engines work and will help find and rank results from large semistructured data sets, for example, searching for relevant reviews on an e-commerce site. Fabric requires a vector policy for each Cosmos DB container, which defines size, dimensionality, and the underlying distance function used to search for similar vectors. Search technologies like DiskANN require a high dimensionality, with at least 1,000 dimensions (and a maximum of 4,096).

Querying Cosmos DB in Fabric

When you query data stored in Cosmos DB through Fabric’s OneLake, you’re working with a mirrored copy of your Cosmos DB data. As you store data, it’s copied across in the Delta Parquet format used in Fabric, allowing you to use any of the supported query tools, including the desktop Power BI for ad hoc analysis. Queries here can be made across all your operational data, not just Cosmos DB, treating it as a unified whole and still taking advantage of Cosmos DB’s feature set for applications that need to use that data.

This also allows you to take advantage of other Fabric features with your Cosmos DB data, for example, using it to quickly add embeddings and a vector index to your data, so it can be used as part of the grounding data for an AI application based on retrieval-augmented generation (RAG).

You create and manage your Cosmos DB databases from inside the Fabric portal. Start by creating a workspace for your data and then add a Cosmos DB database. This can be a new database inside Fabric or one that mirrors data from existing Cosmos DB sources. The second option is useful for adding analytics to existing at-scale applications.

If you’re creating a new database, simply add a new container and start to populate it with data. Even though you’re likely to be using Cosmos DB as a document database with semistructured data in a JSON format, you can use the Fabric portal to query it using SQL. The same tools allow you to query across all Fabric sources, so you can build analyses that work across SQL and NoSQL data, or across multiple Cosmos DB instances. Fabric allows you to name database endpoints, which direct queries and allow joins across databases—including across NoSQL stores like Cosmos DB.

Using Cosmos DB in lakehouses

One of Fabric’s key features is the ability to group large amounts of data stored in a data lake in a lakehouse. This is perhaps best thought of as a way to apply data warehouse principles to the mix of different stores in Fabric, with a lakehouse acting as an intermediate format between the two ways of working with big data. A lakehouse gives you a single SQL endpoint for its data, using Fabric’s Delta tables, automatically extracting table structures and adding them to its abstracted view of the underlying data.

Fabric’s portal provides tools to connect a lakehouse to your Cosmos DB data, using its data engineering feature to either create a new lakehouse that holds a Delta-format mirrored Cosmos DB store or add your Cosmos DB tables to an existing one. You can then use the built-in notebooks with Python and Spark to build SQL queries and display the results.

As Cosmos DB allows you to build code inside your database and use it to process data, you can use Fabric’s Git integration to link your database to Azure devops or GitHub. Code and other artifacts can be quickly shared between repositories, allowing quick setup of new environments. You can also take advantage of both platforms’ continuous integration/continuous delivery (CI/CD) capabilities to build a deployment pipeline that ensures code is tested and ready.

A different billing model

If you’re familiar with Cosmos DB’s RU resource model to account for usage and pay for your databases, be aware that using Cosmos DB in Fabric switches you to Fabric’s own capacity unit model, which works on a per-hour basis. Microsoft provides a basic conversion table that helps you understand the costs associated with using it for your data. For now, 100RU/s is equivalent to 0.067CU/hr.

You’ll need to consider how this affects your Fabric usage budget, especially if you’re allowing Cosmos DB instances to autoscale. You can set limits to how your databases scale, but to get the most control (and to set higher default scaling options), you will need to deploy and manage your databases through the Fabric SDK.

Microsoft has done a lot of work to bring Cosmos DB and Fabric together. Both are key components of its cloud data platform, and the result is an important upgrade to both, especially with the mirroring capabilities in lakehouses for large-scale, complex analytics. Microsoft is bringing its big data engineering tools to its databases, an essential move that should make everyone’s lives a little easier.

Lire la suite sur InfoWorld