BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles An Introduction and Tutorial for Azure Cosmos DB

An Introduction and Tutorial for Azure Cosmos DB

Bookmarks

Key Takeaways

  • Cosmos DB provides a useful adjunct to a relational database system. Core SQL API is an easy way in to the technologies of JSON, document databases, NoSQL and big data
  • Cosmos DB can turn out to be very expensive if used incorrectly
  • Cosmos DB uses auto-indexing and it is important to review the default indexing policy
  • Regular Cosmos DB performance tuning should be considered in your development cycle

An Introduction to Azure Cosmos DB

Azure Cosmos DB is a globally distributed, JSON-based database delivered as a ‘Platform as a Service’ (PaaS) in Microsoft Azure. It allows users to build and distribute their applications across an Azure data centre automatically without any need for prior configuration. Because it is part of Azure, it is available in all regions where Azure is available, and it replicates its data across various data centres in the network. It uses several different interfaces, the most intriguing one being SQL-based.

Why use it? I like it because it can extract tabular data very rapidly from JSON files using SQL that has extensions for dealing with hierarchical information. It is a tool I’ve wanted for some time now, a way of pulling the tabular nuggets of data you need from large quantities of JSON.

Benefits of Azure Cosmos DB

If your organisation has to manage, process and query a great deal of important but short-lived information that is created sporadically and must be reported speedily, then a service like Cosmos DB is ideal. This is because Cosmos DB provides high availability, globally-distributed access to data with very low latency, and compute over data, and so provides a more cost-effective solution for ephemeral information than is possible with an RDBMS. However, I typically design database systems that handle both highly transactional data processing as well as semi-structured, graph and network data. Consequently, for me, Cosmos DB it isn’t a rival for an RDBMS so much as an accessory.

I suspect that Cosmos DB was initially created to serve the many organisations that are obliged to deal with demand-driven activity that increases and diminishes in a manner that is outside the organisation’s direct control, such as ticket sales, football clubs or agricultural dealers. Semi-structured, and unstructured data must be stored and processed reliably. Traditional transactional processing systems based on relational databases continue to be ideal for the core business process, but are best used at a reasonably constant rate of transactions. Sudden extreme peaks of demand are generally dreaded by IT professionals, and certainly by me. However, no healthy enterprise can ignore such things as purchase orders, payments, enquiries and complaints. The system you use must be able to expand and contract easily in response to the scale of demand yet offer extremely low latency. Most importantly, it must be of rock-solid reliability. In addition, organisations are increasingly committed to eliciting more accurate and finer-scale marketing information; responding to ‘sentiment analysis’ from clicks to tweets. This generates simple conclusions from large quantities of ephemeral data, which is not the stuff of relational databases.

Cosmos DB provides us, with its SQL API a useful adjunct to a relational database system, and the SQL it offers is an easy way in to the technologies of JSON, document databases, NoSQL and big data.

Disadvantages of Azure Cosmos DB – Pricing, Anyone?

There are certain well-known issues with Cosmos DB.

  • Cosmos DB can turn out to be very expensive if you're using it for any data storage that is better handled by plain Azure Table Storage or Blob Storage. It’s not for unstructured data that you wish to store, but data that you wish to query and update.
  • Because of the charging model, it is important to reduce the resources used for queries, but it is frustratingly difficult, without additional Cosmos DB tooling,  to determine the resources (DTUs) that an individual query uses and the Request Units (RUs) it consumes.
  • Cosmos uses auto-indexing, and this is important to do correctly; but this isn’t intuitive.
  • Migration of data between SQL Server and Azure Cosmos DB is curiously difficult. There is an open-source application called the Database migration tool for Azure Cosmos DB that is provided to do this ‘for small migrations’. It isn’t a high-performance tool. I had to stop a simple table migration after seventeen hours. Azure Data Factory is difficult to set up for small jobs.
  • To save costs, Azure Cosmos DB databases are best developed on a free plan or an emulator and then transferred.
  • The ODBC driver is fine for simple queries only.

I have found the process of setting up a Cosmos DB database to be tricky with the instructions given. Some things, such as deleting a database or container are made deliberately awkward (Doctor Google had to be consulted) but the people who designed the portal don’t need to try to be awkward. They have a natural flair for it.  There are many useful guides on the internet to guide you through the Cosmos DB GUI, and I benefited greatly from using them,  but GUIs were originally invented to provide an intuitive interface that didn’t need an additional roadmap.

An introduction to Cosmos DB data architecture

If you are using Azure Cosmos DB alongside a relational database, it makes sense to use SQL for both.  By default, Cosmos uses a SQL dialect with extensions to enable it to explore hierarchies. Cosmos DB uses generic terms for storage: ‘container’ and ‘item’. A container can be a collection, graph, or table. An item can be a document, edge/vertex, or row: Basically, items compromise the data content inside a container. If you opt for the SQL ‘experience’, the Cosmos database can be represented thus.

A database is just a logical group of ‘items’ partitioned across ‘containers’.

Setting up an Azure Cosmos DB account

Assuming you’ve already got a Windows Azure subscription, you’ll need to set up an Azure Cosmos DB account and a resource group. Setting the resource group seems pretty straightforward. I’ve been using Azure for many years and so had the prerequisites set up. In fact, I had a Cosmos account and database set up once but I’d deleted it because it was costing me dollars, unlike MongoDB which I used instead.  To be fair, Microsoft have since then increased the limit of free storage from 5GB to 25 GB of storage and 400 request units per second (Ru/s) to 1000 Ru/s for every month per subscription, but in doing the research for this article I managed to somehow blow my MSDN allowance.

Creating the Azure Cosmos DB account should have been reasonably straightforward, but it failed in my case because the region I chose was full, for some reason. I deleted that, and tried again specifying a different region, but the old one remained in half-life like a zombie. This meant I couldn’t create another one because I’d have exceeded my allocation. I eventually managed to kill the zombie.

An account is pretty simple to set up once you’re aware that it is related to a resource group, a subscription, a region, and a default ‘experience’ (SQL in my case!)

With an account and a database, we can create one or more containers. A container is, evidently,  ‘a schema-agnostic container of items’,  which they say is ‘the unit of scalability both for provisioned throughput and storage that can scale elastically’. Basically, it is equivalent to a database schema, or a MongoDB collection, but the documentation avoids saying as much.

Importing data with the Azure Cosmos DB migration tool

To get data into Cosmos DB, you’ll probably need the data migration tool. Dtui.exe is the GUI tool that is easier to use, whereas the dt.exe command-line tool is more useful for regular inserts. Once you have an import that works, the Dtui.exe tool has a facility for displaying the parameters it passes to the DT.exe tool which you can then use for scripting.

With the data migration tool, you can import from SQL Server, MongoDB, MongoDB Export, SQL Server, json files, CSV file(s), Azure Tables, DynamoDB, HBase, or Amazon DocumentDB.  My advice is to get a transfer system working in the dtui.exe UI tool and then script it in PowerShell using the dt.exe command-line tool,  (be warned though, in some cases this doesn’t work). I found that the query that you provide for many operations needs to be kept simple. Puzzlingly, performance drops rapidly for more complex queries, even when they work fine in a Node.JS interface.

On top you should be aware of the fact that the tool doesn’t handle RU throttling as you’d expect, basically every request resulting in a 429 error will be dropped and forgotten. This results in a partial data import with no way of telling what has been uploaded and what wasn’t. I can only assume that Microsoft expects you to scale up your Cosmos DB here instead of optimizing your data import. Even Microsoft doesn’t recommend it for moving large quantities of data about.

Accessing Cosmos DB using PowerShell libraries

There are plentiful PowerShell tools for setting up databases and containers, and for creating queries. You are likely to need PowerShell sooner or later so it is just as well to get used to it from the start.

To get started in PowerShell 7:

  • If you don’t yet have them, Install the Azure PowerShell Modules (Install-Module -Name Az -Force -Verbose). This provides all the basic Azure operations, and the Az.Cosmos DB provides everything you need to automate Cosmos tasks.
  • Install Azure CLI on Windows (required for Bicep, the following step)
  • Install Bicep CLI on Windows (useful for deploying new accounts)
  • Sign into Azure (Connect-AzAccount )
  • Set your subscription context (Set-AzContext -SubscriptionName "My Azure Subscription")
  • Install the Cosmos DB community module (Install-Module -Name Cosmos DB -Verbose)
  • Import the Cosmos DB module (Import-Module -Name Cosmos DB)

Probably the best place to start with PowerShell for the management of a Cosmos DB database is with  Manage Azure Cosmos DB Core (SQL) API resources using PowerShell

Before doing too much on with Cosmos DB on a windows platform, it is worth reading the best practice document.

The Cosmos DB library is supposedly able to make queries to a Cosmos DB database via the Get-Cosmos DBDocument Cmdlet but although I was able to get most of the Cmdlets to work, this one repeatedly gave me errors.

SQL query in Cosmos DB using ODBC

The management of a working database can’t be done without a way of developing SQL queries that return the results that you need in a timely manner.  Not only do you need the SQL statements and queries, but you also need the means to run them from your application or software tool. 

To start to try out SQL queries in Azure Cosmos DB, there is no better available tool than Alpaqa Studio. This does everything you need, and enables you to try out the expressions that are allowed in the SQL-92 standard, and see what is returned. I was agreeably surprised at both the SQL compliance of a document-oriented database, and the performance. This tool will allow you to develop the queries that you would want to subsequently use from your scripts or reporting applications to get results from Cosmos DB that can be consumed in your application or save to disk.

Alpaqa Studio for Cosmos DB displaying results of a SQL query

ODBC gives immediate access to any ODBC-compliant tools such as Excel, SSIS (SQL Server Integration Services) , Tableau, Alteryx and QlikSense. It also allows other analytics, BI, and data integration tools to  analyse, move, transform, and create visualizations of Azure Cosmos DB data.

Most important for SQL Server users is to set up Cosmos DB as a linked server so as to allow for data flow at server level. Unfortunately, the Microsoft ODBC for Cosmos DB (actually it is the old one for Azure DocumentDB) supplied by Microsoft is slow and buggy. I became interested in the fact that my longer and more complex queries that worked fine in Alpaqa Studio, crashed ODBC so badly that it forced the PowerShell ISE to restart. I discovered by long experiment with a PowerShell script that it would execute a query reasonably well until I added an Order By statement to it, at which point it would cause PowerShell to restart.

Even when it was working, it was slow. I tried to specify the schema that my database was using, but it failed to detect it.

Basically, it could manage a small Cosmos DB database, but nothing of any size. I can execute the same query in Alpaqa Studio with 15 ms response times, and in Node.js in 20 ms, rather than 10 sec response times from this ODBC connector.

I’m prepared to admit that I could have made a grievous mistake in installing this software, but I came to believe that this ODBC isn’t useful, and doesn’t do justice to a database system that performs well and reliably.

To get up and running, I installed Cdata’s Cosmos DB ODBC Driver. This worked a lot better. It isn’t fast, but at least it is reliable. The problem with it is that it is so expensive they don’t even tell you the price.

Cosmos DB and NodeJS

Probably the most satisfactory approach to accessing Cosmos DB and downloading results is to use either the Java SDK or use the Node.JS approach. I used the latter because I use PowerShell where possible.  Microsoft provides a great start here in Build a Node.js console app with the JavaScript SDK, which contains an easily-modified sample app. Once an app is set up, it can be accessed in PowerShell very easily. This approach allows you to download a result and save it to a json file or to process in PowerShell.  I find that any queries I can develop and optimise in Alpaqa Studio can then be executed with this interface in production with the same performance.  This means that, finally, I have a reliable way of querying a Cosmos DB database, updating it, deleting records and creating new records.

Here is the very simple main app, based on Microsoft’s sample, but adding parameters so you can provide the SQL and JSON export of results. I’ve just shown the Select stage, but the others are easily added.

// Import Configuration
const CosmosClient = require("@azure/cosmos").CosmosClient;
const config = require("./config");
const dbContext = require("./data/databaseContext");
// The main function
async function main() {
    var args = process.argv.slice(2);
    // get the args into a rational form
    // now get the configuration
    const {
        endpoint,
        key,
        databaseId,
        containerId
    } = config;

    const client = new CosmosClient({
        endpoint,
        key
    });
    //  Create the database and container if they do not exist
    const database = client.database(databaseId);
    const container = database.container(containerId);

    // ensure the database is setup and populated correctly
    await dbContext.create(client, databaseId, containerId);
    // query to return all items
    try {

        if (args[0]) {
            var thequery = args[0];
        } else {
            var thequery = "SELECT TOP 10 * FROM c";
        };
        const {
            resources: items
        } = await container.items
            .query(args[0])
            .fetchAll();
        console.log(Object.values(items));
    } catch (err) {
        console.log(err.message);
    }
}
main();

And this can be used simply from PowerShell

$query = "SELECT top 10 c.Firstfield, c.SecondField, c.ThirdField FROM mycontainer c where c.ThirdField like '%teststring%' order by c.Secondfield"
$JsonResult=node <MyPathTo>\app.js "`"$query`""
$JsonResult|convertFrom-json 

The last line returns an object that can be saved in a variety of forms, processed, sent to a reporting application, stored in a relational database- whatever you want.

It’s similarly straightforward to get on with Java SDK, once you start looking for another performance boost. Java SDK v4 combined with its Async API is especially useful for high-throughput parallel processing of data. The examples provided in the documentation offer a good starting point.

Conclusions

Cosmos DB provides an excellent way of extending the capabilities of a database, particularly for the types of data that fit awkwardly into the relational model. I was surprised, when trying out the various technologies for providing the ‘glue’ between the relational and document databases, and for allowing reporting, analysis and editing, that there was a lot of ‘wet paint’, meaning that vital components such as ODBC and CLI utilities worked disappointingly.

Mercifully, the Alpaqa Studio allowed me to create, debug and performance-tune queries, that I could then use in PowerShell, via Node JS. It solves my problem, but without a good ODBC driver, I fear that many database professionals will shrug and try something else. ODBC/JDBC is still an important technology when working with many different types of databases.

 

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Thanks for the write up

    by Mark Brown,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thank you for the write up on Cosmos DB. Appreciate it. Just a few comments.

    The PowerShell cmdlets created by the Cosmos DB team can only do management operations on Cosmos DB resources. They cannot query or read data. The Get-Cosmos DBDocument Cmdlet you are referring to maybe from this OSS PS library, github.com/PlagueHO/CosmosDB.

    Regarding SQL-92 while Cosmos DB may support a subset of it, it is definitely not fully compliant and could never be because it was designed to work with hierarchical JSON data.

    As you observed, ODBC is not really well suited for integrating Cosmos DB with analytics tools. The problem with connecting ODBC to Cosmos DB is analytical queries can consume large volumes of throughput impacting the performance of the operational workload. Cosmos DB is an OLTP database and is also resource governed. Its compute engine was not designed to power large, long running analytical queries. A preferred method for doing analytics is to integrate with analytical tools using Spark or Kafka. Azure Data Factory is also a good choice with its new ability to read from the Cosmos DB change feed (CDC). Another great option for analytics is Azure Synapse via Synapse Link. Synapse Link provides true HTAP capabilities for users, with near real-time ETL and the ability to run extremely large analytics using Spark or SQL with zero impact on the Cosmos DB operational store.

    Thanks again for the write up.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT