BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Introduction to MongoDB for Java, PHP and Python Developers

Posted by Rick Hightower on May 22, 2012 |

Introduction to NoSQL Architecture with MongoDB for Java, PHP and Python Developers

This article covers using MongoDB as a way to get started with NoSQL. It presents an introduction to considering NoSQL, why MongoDB is a good NoSQL implementation to get started with, MongoDB shortcommings and tradeoffs, key MongoDB developer concepts, MongoDB architectural concepts (sharding, replica sets), using the console to learn MongoDB, and getting started with MongoDB for Python, Java and PHP developers. The article uses MongoDB, but many concepts introduced are common in other NoSQL solutions. The article should be useful for new developers, ops and DevOps who are new to NoSQL.

From no NoSQL to sure why not

The first time I heard of something that actually could be classified as NoSQL was from Warner Onstine, he is currently working on some CouchDB articles for InfoQ. Warner was going on and on about how great CouchDB was. This was before the term NoSQL was coined. I was skeptical, and had just been on a project that was converted from an XML Document Database back to Oracle due to issues with the XML Database implementation. I did the conversion. I did not pick the XML Database solution, or decide to convert it to Oracle. I was just the consultant guy on the project (circa 2005) who did the work after the guy who picked the XML Database moved on and the production issues started to happen.

This was my first document database. This bred skepticism and distrust of databases that were not established RDBMS (Oracle, MySQL, etc.). This incident did not create the skepticism. Let me explain.

First there were all of the Object Oriented Database (OODB) folks for years preaching how it was going to be the next big thing. It did not happen yet. I hear 2013 will be the year of the OODB just like it was going to be 1997. Then there were the XML Database people preaching something very similar, which did not seem to happen either at least at the pervasive scale that NoSQL is happening.

My take was, ignore this document oriented approach and NoSQL, see if it goes away. To be successful, it needs some community behind it, some clear use case wins, and some corporate muscle/marketing, and I will wait until then. Sure the big guys need something like Dynamo and BigTable, but it is a niche I assumed. Then there was BigTable, MapReduce, Google App Engine, Dynamo in the news with white papers. Then Hadoop, Cassandra, MongoDB, Membase, HBase, and the constant small but growing drum beat of change and innovation. Even skeptics have limits.

Then in 2009, Eric Evans coined the term NoSQL to describe the growing list of open-source distributed databases. Now there is this NoSQL movement-three years in and counting. Like Ajax, giving something a name seems to inspire its growth, or perhaps we don't name movements until there is already a ground swell. Either way having a name like NoSQL with a common vision is important to changing the world, and you can see the community, use case wins, and corporate marketing muscle behind NoSQL. It has gone beyond the buzz stage. Also in 2009 was the first project that I worked on that had mass scale out requirements that was using something that is classified as part of NoSQL.

2009 was when MongoDB was released from 10Gen, the NoSQL movement was in full swing. Somehow MongoDB managed to move to the front of the pack in terms of mindshare followed closely by Cassandra and others (see figure 1). MongoDB is listed as a top job trend on Indeed.com, #2 to be exact (behind HTML 5 and before iOS), which is fairly impressive given MongoDB was a relativly latecomer to the NoSQL party.

 

Figure 1: MongoDB leads the NoSQL pack

MongoDB takes early lead in NoSQL adoption race.

MongoDB is a distributed document-oriented, schema-less storage solution similar to CouchBase and CouchDB. MongoDB uses JSON-style documents to represent, query and modify data. Internally data is stored in BSON (binary JSON). MongoDB's closest cousins seem to be CouchDB/Couchbase. MongoDB supports many clients/languages, namely, Python, PHP, Java, Ruby, C++, etc. This article is going to introduce key MongoDB concepts and then show basic CRUD/Query examples in JavaScript (part of MongoDB console application), Java, PHP and Python.

Disclaimer: I have no ties with the MongoDB community and no vested interests in their success or failure. I am not an advocate. I merely started to write about MongoDB because they seem to be the most successful, seem to have the most momentum for now, and in many ways typify the very diverse NoSQL market. MongoDB success is largely due to having easy-to-use, familiar tools. I'd love to write about CouchDB, Cassandra, CouchBase, Redis, HBase or number of NoSQL solution if there was just more hours in the day or stronger coffee or if coffee somehow extended how much time I had. Redis seems truly fascinating.

MongoDB seems to have the right mix of features and ease-of-use, and has become a prototypical example of what a NoSQL solution should look like. MongoDB can be used as sort of base of knowledge to understand other solutions (compare/contrast). This article is not an endorsement. Other than this, if you want to get started with NoSQL, MongoDB is a great choice.

MongoDB a gateway drug to NoSQL

MongoDB combinations of features, simplicity, community, and documentation make it successful. The product itself has high availability, journaling (which is not always a given with NoSQL solutions), replication, auto-sharding, map reduce, and an aggregation framework (so you don't have to use map-reduce directly for simple aggregations). MongoDB can scale reads as well as writes.

NoSQL, in general, has been reported to be more agile than full RDBMS/ SQL due to problems with schema migration of SQL based systems. Having been on large RDBMS systems and witnessing the trouble and toil of doing SQL schema migrations, I can tell you that this is a real pain to deal with. RDBMS / SQL often require a lot of upfront design or a lot schema migration later. In this way, NoSQL is viewed to be more agile in that it allows the applications worry about differences in versioning instead of forcing schema migration and larger upfront designs. To the MongoDB crowd, it is said that MongoDB has dynamic schema not no schema (sort of like the dynamic language versus untyped language argument from Ruby, Python, etc. developers).

MongoDB does not seem to require a lot of ramp up time. Their early success may be attributed to the quality and ease-of-use of their client drivers, which was more of an afterthought for other NoSQL solutions ("Hey here is our REST or XYZ wire protocol, deal with it yourself"). Compared to other NoSQL solution it has been said that MongoDB is easier to get started. Also with MongoDB many DevOps things come cheaply or free. This is not that there are never any problems or one should not do capacity planning. MongoDB has become for many an easy on ramp for NoSQL, a gateway drug if you will.

MongoDB was built to be fast. Speed is a good reason to pick MongoDB. Raw speed shaped architecture of MongoDB. Data is stored in memory using memory mapped files. This means that the virtual memory manager, a very highly optimized system function of modern operating systems, does the paging/caching. MongoDB also pads areas around documents so that they can be modified in place, making updates less expensive. MongoDB uses a binary protocol instead of REST like some other implementations. Also, data is stored in a binary format instead of text (JSON, XML), which could speed writes and reads.

Another reason MongoDB may do well is because it is easy to scale out reads and writes with replica sets and autosharding. You might expect if MongoDB is so great that there would be a lot of big names using them, and there are like: MTV, Craigslist, Disney, Shutterfly, Foursqaure, bit.ly, The New York Times, Barclay’s, The Guardian, SAP, Forbes, National Archives UK, Intuit, github, LexisNexis and many more.

Caveats

MongoDB indexes may not be as flexible as Oracle/MySQL/Postgres or other even other NoSQL solutions. The order of index matters as it uses B-Trees. Realtime queries might not be as fast as Oracle/MySQL and other NoSQL solutions especially when dealing with array fields, and the query plan optimization work is not as far as long as more mature solutions. You can make sure MongoDB is using the indexes you setup quite easily with an explain function. Not to scare you, MongoDB is good enough if queries are simple and you do a little homework, and is always improving.

MongoDB does not have or integrate a full text search engine like many other NoSQL solutions do (many use Lucene under the covers for indexing), although it seems to support basic text search better than most traditional databases.

Every version of MongoDB seems to add more features and addresses shortcoming of the previous releases. MongoDB added journaling a while back so they can have single server durability; prior to this you really needed a replica or two to ensure some level of data safety. 10gen improved Mongo's replication and high availability with Replica Sets.

Another issue with current versions of MongoDB (2.0.5) is lack of concurrency due to MongoDB having a global read/write lock, which allows reads to happen concurrently while write operations happen one at a time. There are workarounds for this involving shards, and/or replica sets, but not always ideal, and does not fit the "it should just work mantra" of MongoDB. Recently at the MongoSF conference, Dwight Merriman, co-founder and CEO of 10gen, discussed the concurrency internals of MongoDB v2.2 (future release). Dwight described that MongoDB 2.2 did a major refactor to add database level concurrency, and will soon have collection level concurrency now that the hard part of the concurrency refactoring is done. Also keep in mind, writes are in RAM and eventually get synced to disk since MongoDB uses a memory mapped files. Writes are not as expensive as if you were always waiting to sync to disk. Speed can mitigate concurrency issues.

This is not to say that MongoDB will never have some shortcomings and engineering tradeoffs. Also, you can, will and sometimes should combine MongoDB with a relational database or a full text search like Solr/Lucene for some applications. For example, if you run into issue with effectively building indexes to speed some queries you might need to combine MongoDB with Memcached. None of this is completely foreign though, as it is not uncommon to pair RDBMS with Memcached or Lucene/Solr. When to use MongoDB and when to develop a hybrid solution is beyond the scope of this article. In fact, when to use a SQL/RDBMS or another NoSQL solution is beyond the scope of this article, but it would be nice to hear more discussion on this topic.

The price you pay for MongoDB, one of the youngest but perhaps best managed NoSQL solution, is lack of maturity. It does not have a code base going back three decades like RDBMS systems. It does not have tons and tons of third party management and development tools. There have been issues, there are issues and there will be issues, but MongoDB seems to work well for a broad class of applications, and is rapidly addressing many issues.

Also finding skilled developers, ops (admins, devops, etc.) that are familiar with MongoDB or other NoSQL solutions might be tough. Somehow MongoDB seems to be the most approachable or perhaps just best marketed. Having worked on projects that used large NoSQL deployments, few people on the team really understand the product (limitations, etc.), which leads to trouble. 

In short if you are kicking the tires of NoSQL, starting with MongoDB makes a lot of sense.

MongoDB concepts

MongoDB is document oriented but has many comparable concepts to traditional SQL/RDBMS solutions.

  1. Oracle: Schema, Tables, Rows, Columns
  2. MySQL: Database, Tables, Rows, Columns
  3. MongoDB: Database, Collections, Document, Fields
  4. MySQL/Oracle: Indexes
  5. MongoDB: Indexes
  6. MySQL/Oracle: Stored Procedures
  7. MongoDB: Stored JavaScript
  8. Oracle/MySQL: Database Schema
  9. MongoDB: Schema free!
  10. Oracle/MySQL: Foreign keys, and joins
  11. MongoDB: DBRefs, but mostly handled by client code
  12. Oracle/MySQL: Primary key
  13. MongoDB: ObjectID

If you have used MySQL or Oracle here is a good guide to similar processes in MongoDB:

Database Process Type Oracle MySQL MongoDB
Daemon/Server oracle mysqld mongod
Console Client sqlplus mysql mongo
Backup utility sqlplus mysqldump mongodump
Import utility sqlplus mysqlimport mongoimport

You can see a trend here. Where possible, Mongo tries to follow the terminology of MySQL. They do this with console commands as well. If you are used to using MySQL, where possible, MongoDB tries to make the transition a bit less painful.

 

SQL operations versus MongoDB operations

MongoDB queries are similar in concept to SQL queries and use a lot of the same terminology. There is no special language or syntax to execute MongoDB queries; you simply assemble a JSON object. The MongoDB site has a complete set of example queries done in both SQL and MongoDB JSON docs to highlight the conceptual similarities. What follows is several small listings to compare MongoDB operations to SQL.

 

Insert

SQL
INSERT INTO CONTACTS (NAME, PHONE_NUMBER) VALUES('RICK HIGHTOWER','520-555-1212')
MongoDB
db.contacts.insert({name:'RICK HIGHTOWER',phoneNumber:'520-555-1212'})

Selects

SQL
SELECT name, phone_number FROM contacts WHERE age=30 ORDER BY name DESC
MongoDB
db.contacts.find({age:30}, {name:1,phoneNumber:1}).sort({name:-1})
SQL
SELECT name, phone_number FROM contacts WHERE age>30 ORDER BY name DESC
MongoDB
db.contacts.find({age:{$gt:33}}, {name:1,phoneNumber:1}).sort({name:-1})

Creating indexes

SQL
CREATE INDEX contact_name_idx ON contact(name DESC)
MongoDB
db.contacts.ensureIndex({name:-1})

Updates

SQL
UPDATE contacts SET phoneNumber='415-555-1212' WHERE name='Rick Hightower'
MongoDB
db.contacts.update({name:'Rick Hightower'}, {$set:{phoneNumber:1}}, false, true)

 

Additional features of note

MongoDB has many useful features like Geo Indexing (How close am I to X?), distributed file storage, capped collection (older documents auto-deleted), aggregation framework (like SQL projections for distributed nodes without the complexities of MapReduce for basic operations on distributed nodes), load sharing for reads via replication, auto sharding for scaling writes, high availability, and your choice of durability (journaling) and/or data safety (make sure a copy exists on other servers).

Architecture Replica Sets, Autosharding

The model of MongoDB is such that you can start basic and use features as your growth/needs change without too much trouble or change in design. MongoDB uses replica sets to provide read scalability, and high availability. Autosharding is used to scale writes (and reads). Replica sets and autosharding go hand in hand if you need mass scale out. With MongoDB scaling out seems easier than traditional approaches as many things seem to come built-in and happen automatically. Less operation/administration and a lower TCO than other solutions seems likely. However you still need capacity planning (good guess), monitoring (test your guess), and the ability to determine your current needs (adjust your guess).

Replica Sets

The major advantages of replica sets are business continuity through high availability, data safety through data redundancy, and read scalability through load sharing (reads). Replica sets use a share nothing architecture. A fair bit of the brains of replica sets is in the client libraries. The client libraries are replica set aware. With replica sets, MongoDB language drivers know the current primary. Langauge driver is a library for a particular programing language, think JDBC driver or ODBC driver, but for MongoDB. All write operations go to the primary. If the primary is down, the drivers know how to get to the new primary (an elected new primary), this is auto failover for high availability. The data is replicated after writing. Drivers always write to the replica set's primary (called the master), the master then replicates to slaves. The primary is not fixed. The master/primary is nominated.

Typically you have at least three MongoDB instances in a replica set on different server machines (see figure 2). You can add more replicas of the primary if you like for read scalability, but you only need three for high availability failover. There is a way to sort of get down to two, but let's leave that out for this article. Except for this small tidbit, there are advantages of having three versus two in general. If you have two instances and one goes down, the remaining instance has 200% more load than before. If you have three instances and one goes down, the load for the remaining instances only go up by 50%. If you run your boxes at 50% capacity typically and you have an outage that means your boxes will run at 75% capacity until you get the remaining box repaired or replaced. If business continuity is your thing or important for your application, then having at least three instances in a replica set sounds like a good plan anyway (not all applications need it).

 

Figure 2: Replica Sets

MongoDB Replica Sets

 

In general, once replication is setup it just works. However, in your monitoring, which is easy to do in MongoDB, you want to see how fast data is getting replicated from the primary (master) to replicas (slaves). The slower the replication is, the dirtier your reads are. The replication is by default async (non-blocking). Slave data and primary data can be out of sync for however long it takes to do the replication. There are already whole books written just on making Mongo scalable, if you work at a Foursquare like company or a company where high availability is very important, and use Mongo, I suggest reading such a book.

By default replication is non-blocking/async. This might be acceptable for some data (category descriptions in an online store), but not other data (shopping cart's credit card transaction data). For important data, the client can block until data is replicated on all servers or written to the journal (journaling is optional). The client can force the master to sync to slaves before continuing. This sync blocking is slower. Async/non-blocking is faster and is often described as eventual consistency. Waiting for a master to sync is a form of data safety. There are several forms of data safety and options available to MongoDB from syncing to at least one other server to waiting for the data to be written to a journal (durability). Here is a list of some data safety options for MongoDB:

  1. Wait until write has happened on all replicas
  2. Wait until write is on two servers (primary and one other)
  3. Wait until write has occurred on majority of replicas
  4. Wait until write operation has been written to journal

(The above is not an exhaustive list of options.) The key word of each option above is wait. The more syncing and durability the more waiting, and harder it is to scale cost effectively.

Journaling: Is durability overvalued if RAM is the new Disk? Data Safety versus durability

It may seem strange to some that journaling was added as late as version 1.8 to MongoDB. Journaling is only now the default for 64 bit OS for MongoDB 2.0. Prior to that, you typically used replication to make sure write operations were copied to a replica before proceeding if the data was very important. The thought being that one server might go down, but two servers are very unlikely to go down at the same time. Unless somebody backs a truck over a high voltage utility poll causing all of your air conditioning equipment to stop working long enough for all of your servers to overheat at once, but that never happens (it happened to Rackspace and Amazon). And if you were worried about this, you would have replication across availability zones, but I digress.

At one point MongoDB did not have single server durability, now it does with addition of journaling. But, this is far from a moot point. The general thought from MongoDB community was and maybe still is that to achieve Web Scale, durability was thing of the past. After all memory is the new disk. If you could get the data on second server or two, then the chances of them all going down at once is very, very low. How often do servers go down these days? What are the chances of two servers going down at once? The general thought from MongoDB community was (is?) durability is overvalued and was just not Web Scale. Whether this is a valid point or not, there was much fun made about this at MongoDB's expense (rated R, Mature 17+).

As you recall MongoDB uses memory mapped file for its storage engine so it could be a while for the data in memory to get synced to disk by the operating system. Thus if you did have several machines go down at once (which should be very rare), complete recoverability would be impossible. There were workaround with tradeoffs, for example to get around this (now non issue) or minimize this issue, you could force MongoDB to do an fsync of the data in memory to the file system, but as you guessed even with a RAID level four and a really awesome server that can get slow quick. The moral of the story is MongoDB has journaling as well as many other options so you can decide what the best engineering tradeoff in data safety, raw speed and scalability. You get to pick. Choose wisely.

The reality is that no solution offers "complete" reliability, and if you are willing to allow for some loss (which you can with some data), you can get enormous improvements in speed and scale. Let's face it your virtual farm game data is just not as important as Wells Fargo's bank transactions. I know your mom will get upset when she loses the virtual tractor she bought for her virtual farm with her virtual money, but unless she pays real money she will likely get over it. I've lost a few posts on twitter over the years, and I have not sued once. If your servers have an uptime of 99 percent and you block/replicate to three servers than the probability of them all going down at once is (0.000001) so the probability of them all going down is 1 in 1,000,000. Of course uptime of modern operating systems (Linux) is much higher than this so one in 100,000,000 or more is possible with just three servers. Amazon EC2 offers discounts if they can't maintain an SLA of 99.95% (other cloud providers have even higher SLAs). If you were worried about geographic problems you could replicate to another availability zone or geographic area connected with a high-speed WAN. How much speed and reliability do you need? How much money do you have?

An article on when to use MongoDB journaling versus older recommendations will be a welcome addition. Generally it seems journaling is mostly a requirement for very sensitive financial data and single server solutions. Your results may vary, and don't trust my math, it has been a few years since I got a B+ in statistics, and I am no expert on SLA of modern commodity servers (the above was just spit balling).

If you have ever used a single non-clustered RDBMS system for a production system that relied on frequent backups and transaction log (journaling) for data safety, raise your hand. Ok, if you raised your hand, then you just may not need autosharding or replica sets. To start with MongoDB, just use a single server with journaling turned on. If you require speed, you can configure MongoDB journaling to batch writes to the journal (which is the default). This is a good model to start out with and probably very much like quite a few application you already worked on (assuming that most application don't need high availability). The difference is, of course, if later your application deemed to need high availability, read scalability, or write scalability, MongoDB has your covered. Also setting up high availability seems easier on MongoDB than other more established solutions.

 
Figure 3: Simple setup with journaling and single server ok for a lot of applications

 

Simple Non-Sharded, non replicated installation

 

If you can afford two other servers and your app reads more than it writes, you can get improved high availability and increased read scalability with replica sets. If your application is write intensive then you might need autosharding. The point is you don't have to be Facebook or Twitter to use MongoDB. You can even be working on a one-off dinky application. MongoDB scales down as well as up.

Autosharding

Replica sets are good for failover and speeding up reads, but to speed up writes, you need autosharding. According to a talk by Roger Bodamer on Scaling with MongoDB, 90% of projects do not need autosharding. Conversely almost all projects will benefit from replication and high availability provided by replica sets. Also once MongoDB improves its concurrency in version 2.2 and beyond, it may be the case that 97% of projects don't need autosharding. 

Sharding allows MongoDB to scale horizontally. Sharding is also called partitioning. You partition each of your servers a portion of the data to hold or the system does this for you. MongoDB can automatically change partitions for optimal data distribution and load balancing, and it allows you to elastically add new nodes (MongoDB instances). How to setup autosharding is beyond the scope of this introductory article. Autosharding can support automatic failover (along with replica sets). There is no single point of failure. Remember 90% of deployments don’t need sharding, but if you do need scalable writes (apps like Foursquare, Twitter, etc.), then autosharding was designed to work with minimal impact on your client code.

There are three main process actors for autosharding: mongod (database daemon), mongos, and the client driver library. Each mongod instance gets a shard. Mongod is the process that manages databases, and collections. Mongos is a router, it routes writes to the correct mongod instance for autosharding. Mongos also handles looking for which shards will have data for a query. To the client driver, mongos looks like a mongod process more or less (autosharding is transparent to the client drivers).

Figure 4: MongoDB Autosharding

Sharding with MongoDB

Autosharding increases write and read throughput, and helps with scale out. Replica sets are for high availability and read throughput. You can combine them as shown in figure 5.

Figure 5: MongoDB Autosharding plus Replica Sets for scalable reads, scalable writes, and high availability

MongoDB Autosharding for Scalable Reads, Scalable Writes and High Availability

You shard on an indexed field in a document. Mongos collaborates with config servers (mongod instances acting as config servers), which have the shard topology (where do the key ranges live). Shards are just normal mongod instances. Config servers hold meta-data about the cluster and are also mongodb instances. 

Shards are further broken down into 64 MB chunks called chunks. A chunk is 64 MB worth of documents for a collection. Config servers hold which shard the chunks live in. The autosharding happens by moving these chunks around and distributing them into individual shards. The mongos processes have a balancer routine that wakes up so often, it checks to see how many chunks a particular shard has. If a particular shard has too many chunks (nine more chunks than another shard), then mongos starts to move data from one shard to another to balance the data capacity amongst the shards. Once the data is moved then the config servers are updated in a two phase commit (updates to shard topology are only allowed if all three config servers are up).

The config servers contain a versioned shard topology and are the gatekeeper for autosharding balancing. This topology maps which shard has which keys. The config servers are like DNS server for shards. The mongos process uses config servers to find where shard keys live. Mongod instances are shards that can be replicated using replica sets for high availability. Mongos and config server processes do not need to be on their own server and can live on a primary box of a replica set for example. For sharding you need at least three config servers, and shard topologies cannot change unless all three are up at the same time. This ensures consistency of the shard topology. The full autosharding topology is show in figure 6. An excellent talk on the internals of MongoDB sharding was done by Kristina Chodorow, author of Scaling MongoDB, at OSCON 2011 if you would like to know more.

 

Figure 6: MongoDB Autosharding full topology for large deployment including Replica Sets, Mongos routers, Mongod Instance, and Config Servers 

MongoDB Autosharding full topology for large deployment including Replica Sets, mongos routers, mongod instance, client drivers and config servers

 

MapReduce

MongoDB has MapReduce capabilities for batch processing similar to Hadoop. Massive aggregation is possible through the divide and conquer nature of MapReduce. Before the Aggregation Framework MongoDB's MapReduced could be used instead to implement what you might do with SQL projections (Group/By SQL). MongoDB also added the aggregation framework, which negates the need for MapReduce for common aggregation cases. In MongoDB, Map and Reduce functions are written in JavaScript. These functions are executed on servers (mongod), this allows the code to be next to data that it is operating on (think stored procedures, but meant to execute on distributed nodes and then collected and filtered). The results can be copied to a results collections.

MongoDB also provides incremental MapReduce. This allows you to run MapReduce jobs over collections, and then later run a second job but only over new documents in the collection. You can use this to reduce work required by merging new data into existing results collection.

 

Aggregation Framework

The Aggregation Framework was added in MongoDB 2.1. It is similar to SQL group by. Before the Aggregation framework, you had to use MapReduce for things like SQL's group by. Using the Aggregation framework capabilities is easier than MapReduce. Let's cover a small set of aggregation functions and their SQL equivalents inspired by the Mongo docs as follows:.

Count

SQL
SELECT COUNT(*) FROM employees
MongoDB
db.users.employees([ 
{ $group: {_id:null, count:{$sum:1}} }
])

Sum salary where of each employee who are not retired by department

SQL
SELECT dept_name SUM(salary) FROM employees WHERE retired=false GROUP BY dept_name
MongoDB
db.orders.aggregate([ 
 { $match:{retired:false} }, 
 { $group:{_id:"$dept_name",total:{$sum:"$salary"}} } 
])

This is quite an improvement over using MapReduce for common projections like these. Mongo will take care of collecting and summing the results from multiple shards if need be.

Installing MongoDB

Let's mix in some code samples to try out along with the concepts.

To install MongoDB go to their download page, download and untar/unzip the download to ~/mongodb-platform-version/. Next you want to create the directory that will hold the data and create a mongodb.config file (/etc/mongodb/mongodb.config) that points to said directory as follows:

Listing: Installing MongoDB

$ sudo mkdir /etc/mongodb/data


$ cat /etc/mongodb/mongodb.config 
dbpath=/etc/mongodb/data

The /etc/mongodb/mongodb.config has one line dbpath=/etc/mongodb/data that tells mongo where to put the data. Next, you need to link mongodb to /usr/local/mongodb and then add it to the path environment variable as follows:

Listing: Setting up MongoDB on your path

$ sudo ln -s  ~/mongodb-platform-version/  /usr/local/mongodb
$ export PATH=$PATH:/usr/local/mongodb/bin

Run the server passing the configuration file that we created earlier.

Listing: Running the MongoDB server

$ mongod --config /etc/mongodb/mongodb.config

Mongo comes with a nice console application called mongo that let's you execute commands and JavaScript. JavaScript to Mongo is what PL/SQL is to Oracle's database. Let's fire up the console app, and poke around.

Firing up the mongos console application

$ mongo
MongoDB shell version: 2.0.4
connecting to: test
…
> db.version()
2.0.4
> 

One of the nice things about MongoDB is the self describing console. It is easy to see what commands a MongoDB database supports with the db.help() as follows:

Client: mongo db.help()

> db.help()
DB methods:
db.addUser(username, password[, readOnly=false])
db.auth(username, password)
db.cloneDatabase(fromhost)
db.commandHelp(name) returns the help for the command
db.copyDatabase(fromdb, todb, fromhost)
db.createCollection(name, { size : ..., capped : ..., max : ... } )
db.currentOp() displays the current operation in the db
db.dropDatabase()
db.eval(func, args) run code server-side
db.getCollection(cname) same as db['cname'] or db.cname
db.getCollectionNames()
db.getLastError() - just returns the err msg string
db.getLastErrorObj() - return full status object
db.getMongo() get the server connection object
db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair
db.getName()
db.getPrevError()
db.getProfilingStatus() - returns if profiling is on and slow threshold 
db.getReplicationInfo()
db.getSiblingDB(name) get the db at the same server as this one
db.isMaster() check replica primary status
db.killOp(opid) kills the current operation in the db
db.listCommands() lists all the db commands
db.logout()
db.printCollectionStats()
db.printReplicationInfo()
db.printSlaveReplicationInfo()
db.printShardingStatus()
db.removeUser(username)
db.repairDatabase()
db.resetError()
db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into { cmdObj : 1 }
db.serverStatus()
db.setProfilingLevel(level,{slowms}) 0=off 1=slow 2=all
db.shutdownServer()
db.stats()
db.version() current version of the server
db.getMongo().setSlaveOk() allow queries on a replication slave server
db.fsyncLock() flush data to disk and lock server for backups
db.fsyncUnock() unlocks server following a db.fsyncLock() 

You can see some of the commands refer to concepts we discussed earlier. Now let's create a employee collection, and do some CRUD operations on it.

Create Employee Collection

 > use tutorial; 
switched to db tutorial 
> db.getCollectionNames(); [ ]
 > db.employees.insert({name:'Rick Hightower', gender:'m', gender:'m', phone:'520-555-1212', age:42}); 
Mon Apr 23 23:50:24 [FileAllocator] allocating new datafile /etc/mongodb/data/tutorial.ns, ...

The use command uses a database. If that database does not exist, it will be lazily created the first time we access it (write to it). The db object refers to the current database. The current database does not have any document collections to start with (this is why db.getCollections() returns an empty list). To create a document collection, just insert a new document. Collections like databases are lazily created when they are actually used. You can see that two collections are created when we inserted our first document into the employees collection as follows:

 

> db.getCollectionNames();
[ "employees", "system.indexes" ]

The first collection is our employees collection and the second collection is used to hold onto indexes we create.

To list all employees you just call the find method on the employees collection.

> db.employees.find()
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
    "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The above is the query syntax for MongoDB. There is not a separate SQL like language. You just execute JavaScript code, passing documents, which are just JavaScript associative arrays, err, I mean JavaScript objects. To find a particular employee, you do this:

> db.employees.find({name:"Bob"})

Bob quit so to find another employee, you would do this:

> db.employees.find({name:"Rick Hightower"})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The console application just prints out the document right to the screen. I don't feel 42. At least I am not 100 as shown by this query:

> db.employees.find({age:{$lt:100}})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Notice to get employees less than a 100, you pass a document with a subdocument, the key is the operator ($lt), and the value is the value (100). Mongo supports all of the operators you would expect like $lt for less than, $gt for greater than, etc. If you know JavaScript, it is easy to inspect fields of a document, as follows:

> db.employees.find({age:{$lt:100}})[0].name
Rick Hightower

If we were going to query, sort or shard on employees.name, then we would need to create an index as follows:

db.employees.ensureIndex({name:1}); //ascending index, descending would be -1

Indexing by default is a blocking operation, so if you are indexing a large collection, it could take several minutes and perhaps much longer. This is not something you want to do casually on a production system. There are options to build indexes as a background task, to setup a unique index, and complications around indexing on replica sets, and much more. If you are running queries that rely on certain indexes to be performant, you can check to see if an index exists with db.employees.getIndexes(). You can also see a list of indexes as follows:

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "tutorial.employees", "name" : "_id_" }

By default all documents get an object id. If you don't not give it an object an _id, it will be assigned one by the system (like a criminal suspects gets a lawyer). You can use that _id to look up an object as follows with findOne:

> db.employees.findOne({_id : ObjectId("4f964d3000b5874e7a163895")})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
   "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Java and MongoDB

Pssst! Here is a dirtly little secret. Don't tell your Node.js friends or Ruby friends this. More Java developers use MongoDB than Ruby and Node.js. They just are not as loud about it. Using MongoDB with Java is very easy. 

The language driver for Java seems to be a straight port of something written with JavaScript in mind, and the usuability suffers a bit because Java does not have literals for maps/objects like JavaScript does. Thus an API written for a dynamic langauge does not quite fit Java. There can be a lot of useability improvement in the MongoDB Java langauge driver (hint, hint). There are alternatives to using just the straight MongoDB language driver, but I have not picked a clear winner (mjormmorphia, and Spring data MongoDB support). I'd love just some usuability improvements in the core driver without the typical Java annotation fetish, perhaps a nice Java DAO DSL (see section on criteria DSL if you follow the link). 

Setting up Java and MongoDB

Let's go ahead and get started then with Java and MongoDB.

Download latest mongo driver from github (https://github.com/mongodb/mongo-java-driver/downloads), then put it somewhere, and then add it to your classpath as follows:

$ mkdir tools/mongodb/lib
$ cp mongo-2.7.3.jar tools/mongodb/lib

Assuming you are using Eclipse, but if not by now you know how to translate these instructions to your IDE anyway. The short story is put the mongo jar file on your classpath. You can put the jar file anywhere, but I like to keep mine in ~/tools/.

If you are using Eclipse it is best to create a classpath variable so other projects can use the same variable and not go through the trouble. Create new Eclipse Java project in a new Workspace. Now right click your new project, open the project properties, go to the Java Build Path->Libraries->Add Variable->Configure Variable shown in figure 7.

Figure 7: Adding Mongo jar file as a classpath variable in Eclipse

Eclipse, Java, MongoD

For Eclipse from the "Project Properties->Java Build Path->Libraries", click "Add Variable", select "MONGO", click "Extend…", select the jar file you just downloaded.

Figure 8: Adding Mongo jar file to your project

Eclipse, MongoDB, Java

Once you have it all setup, working with Java and MongoDB is quite easy as shown in figure 9.

Figure 9 Using MongoDB from Eclipse

Java and MongoDB

The above is roughly equivalent to the console/JavaScript code that we were doing earlier. The BasicDBObject is a type of Map with some convenience methods added. The DBCursor is like a JDBC ResultSet. You execute queries with DBColleciton. There is no query syntax, just finder methods on the collection object. The output from the above is:

Out:
{ "_id" : { "$oid" : "4f964d3000b5874e7a163895"} , "name" : "Rick
Hightower" , "gender" : "m" , "phone" : "520-555-1212" ,
"age" : 42.0}
{ "_id" : { "$oid" : "4f984cce72320612f8f432bb"} , "name" : "Diana
Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
"age" : 30}

 

Once you create some documents, querying for them is quite simple as show in figure 10.

Figure 10: Using Java to query MongoDB 

Java and MongoDB

The output from figure 10 is as follows:

Rick?
{ "_id" : { "$oid" : "4f964d3000b5874e7a163895"} , "name" : "Rick
Hightower" , "gender" : "m" , "phone" : "520-555-1212" ,
"age" : 42.0}
Diana?
{ "_id" : { "$oid" : "4f984cae72329d0ecd8716c8"} , "name" : "Diana
Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
"age" : 30}

Diana by object id?
{ "_id" : { "$oid" : "4f984cce72320612f8f432bb"} , "name" : "Diana
Hightower" , "gender" : "f" , "phone" : "520-555-1212" ,
"age" : 30}

Just in case anybody wants to cut and paste any of the above, here it is again all in one go in the following listing.

Listing: Complete Java Listing

package com.mammatustech.mongo.tutorial;

import org.bson.types.ObjectId;

import com.mongodb.BasicDBObject;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.DBObject;
import com.mongodb.Mongo;
import com.mongodb.DB;

public class Mongo1Main {
	public static void main (String [] args) throws Exception {
		Mongo mongo = new Mongo();
		DB db = mongo.getDB("tutorial");
		DBCollection employees = db.getCollection("employees");
		employees.insert(new BasicDBObject().append("name", "Diana Hightower")
		  .append("gender", "f").append("phone", "520-555-1212").append("age", 30));
		DBCursor cursor = employees.find();
		while (cursor.hasNext()) {
			DBObject object = cursor.next();
			System.out.println(object);
		}
		
		//> db.employees.find({name:"Rick Hightower"})
		cursor=employees.find(new BasicDBObject().append("name", "Rick Hightower"));
		System.out.printf("Rick?\n%s\n", cursor.next());
		
		//> db.employees.find({age:{$lt:35}})	
		BasicDBObject query = new BasicDBObject();
	        query.put("age", new BasicDBObject("$lt", 35));
		cursor=employees.find(query);
		System.out.printf("Diana?\n%s\n", cursor.next());
		
		//> db.employees.findOne({_id : ObjectId("4f984cce72320612f8f432bb")})
		DBObject dbObject = employees.findOne(new BasicDBObject().append("_id", 
				new ObjectId("4f984cce72320612f8f432bb")));
		System.out.printf("Diana by object id?\n%s\n", dbObject);
		
		

	}
}

Please note that the above is completely missing any error checking, or resource cleanup. You will need do some of course (try/catch/finally, close connection, you know that sort of thing).

Python MongoDB Setup

Setting up Python and MongoDB are quite easy since Python has its own package manager.

To install mongodb lib for Python MAC OSX, you would do the following:

$ sudo env ARCHFLAGS='-arch i386 -arch x86_64'
$ python -m easy_install pymongo

To install Python MongoDB on Linux or Windows do the following:

$ easy_install pymongo

or

$ pip install pymongo

If you don't have easy_install on your Linux box you may have to do some sudo apt-get install python-setuptools or sudo yum install python-setuptools iterations, although it seems to be usually installed with most Linux distributions these days. If easy_install or pip is not installed on Windows, try reformatting your hard disk and installing a real OS, or if that is too inconvient go here.

Once you have it all setup, you will can create some code that is equivalent to the first console examples as shown in figure 11.

Figure 11: Python code listing part 1

Python, MongoDB, Pymongo

Python does have literals for maps so working with Python is much closer to the JavaScript/Console from earlier than Java is. Like Java there are libraries for Python that work with MongoDB (MongoEngine, MongoKit, and more). Even executing queries is very close to the JavaScript experience as shown in figure 12.

Figure 12: Python code listing part 2

Python, MongoDB, Pymongo

Here is the complete listing to make the cut and paste crowd (like me), happy.

Listing: Complete Python listing

import pymongo
from bson.objectid import ObjectId


connection = pymongo.Connection()

db = connection["tutorial"]
employees = db["employees"]

employees.insert({"name": "Lucas Hightower", 'gender':'m', 'phone':'520-555-1212', 'age':8})

cursor = db.employees.find()
for employee in db.employees.find():
    print employee


print employees.find({"name":"Rick Hightower"})[0]


cursor = employees.find({"age": {"$lt": 35}})
for employee in cursor:
     print "under 35: %s" % employee


diana = employees.find_one({"_id":ObjectId("4f984cce72320612f8f432bb")})
print "Diana %s" % diana

The output for the Python example is as follows:

{u'gender': u'm', u'age': 42.0, u'_id': ObjectId('4f964d3000b5874e7a163895'), u'name': u'Rick Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'f', u'age': 30, u'_id': ObjectId('4f984cae72329d0ecd8716c8'), u'name': u'Diana Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'm', u'age': 8, u'_id': ObjectId('4f9e111980cbd54eea000000'), u'name': u'Lucas Hightower', u'phone':
u'520-555-1212'} 

All of the above but in PHP

Node.js , Ruby, and Python in that order are the trend setter crowd in our industry circa 2012. Java is the corporate crowd, and PHP is the workhorse of the Internet. The "get it done" crowd. You can't have a decent NoSQL solution without having good PHP support.

To install MongoDB support with PHP use pecl as follows:

$ sudo pecl install mongo

Add the mongo.so module to php.ini.

extension=mongo.so

Then assuming you are running it on apache, restart as follows:

$ apachectl stop
$ apachectl start

Figure 13 shows our roughly equivalent code listing in PHP.

Figure 13 PHP code listing

PHP MongoDB

The output for figure 13 is as follows:

Output:
array ( '_id' => MongoId::__set_state(array( '$id' => '4f964d3000b5874e7a163895', )), 'name' => 'Rick Hightower', 
'gender' => 'm', 'phone' => '520-555-1212', 'age' => 42, )

array ( '_id' => MongoId::__set_state(array( '$id' => '4f984cae72329d0ecd8716c8', )), 'name' => 'Diana Hightower', 'gender' => ‘f', 
'phone' => '520-555-1212', 'age' => 30, )

array ( '_id' => MongoId::__set_state(array( '$id' => '4f9e170580cbd54f27000000', )), 'gender' => 'm', 'age' => 8, 'name' => 'Lucas Hightower', 
'phone' => '520-555-1212', )

The other half of the equation is in figure 14.

Figure 14 PHP code listing

PHP, MongoDB

The output for figure 14 is as follows:

Output
Rick? 
array ( '_id' => MongoId..., 'name' => 'Rick Hightower', 'gender' => 'm', 
'phone' => '520-555-1212', 'age' => 42, )
Diana? 
array ( '_id' => MongoId::..., 'name' => 'Diana Hightower', 'gender' => ‘f', 
'phone' => '520-555-1212', 'age' => 30, )
Diana by id? 
array ( '_id' => MongoId::..., 'name' => 'Diana Hightower', 'gender' => 'f', 
'phone' => '520-555-1212', 'age' => 30, )

Here is the complete PHP listing.

PHP complete listing

<!--?php

$m = new Mongo();
$db = $m->selectDB("tutorial");
$employees = $db->selectCollection("employees");
$cursor = $employees->find();

foreach ($cursor as $employee) {
  echo var_export ($employee, true) . "< br />";
}

$cursor=$employees->find( array( "name" => "Rick Hightower"));
echo "Rick? < br /> " . var_export($cursor->getNext(), true);

$cursor=$employees->find(array("age" => array('$lt' => 35)));
echo "Diana? < br /> " . var_export($cursor->getNext(), true);

$cursor=$employees->find(array("_id" => new MongoId("4f984cce72320612f8f432bb")));
echo "Diana by id? < br /> " . var_export($cursor->getNext(), true);
?>

If you like Object mapping to documents you should try the poorly named MongoLoid for PHP

.

Additional shell commands, learning about MongoDB via the console

One of the nice things that I appreciate about MongoDB, that is missing from some other NoSQL implementation, is getting around and finding information with the console application. They seem to closely mimic what can be done in MySQL so if you are familiar with MySQL, you will feel fairly at home in the mongo console.

To list all of the databases being managed by the mongod instance, you would do the following:

> show dbs
local	(empty)
tutorial	0.203125GB

To list the collections being managed by a database you could do this:

> show collections
employees
system.indexes

To show a list of users, you do the following:

> show users

To look at profiling and logging, you do this:

> show profile
db.system.profile is empty
Use db.setProfilingLevel(2) will enable profiling
..

> show logs
global

> show log global
Mon Apr 23 23:33:14 [initandlisten] MongoDB starting : 
   pid=11773 port=27017 dbpath=/etc/mongodb/data 64-bit…
…
Mon Apr 23 23:33:14 [initandlisten] options: 
   { config: "/etc/mongodb/mongodb.config", dbpath: "/etc/mongodb/data" }

Also the commands and JavaScript functions themselves have help associated with them. To see all of the operations that DBCollection supports you could do this:

 > db.employees.help()
DBCollection help
	
db.employees.find().help() - show DBCursor help
db.employees.count()
db.employees.dataSize()
db.employees.distinct( key ) - eg. db.employees.distinct( 'x' )
db.employees.drop() drop the collection
db.employees.dropIndex(name)
db.employees.dropIndexes()
db.employees.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups
db.employees.reIndex()
db.employees.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
    e.g. db.employees.find( {x:77} , {name:1, x:1} )
db.employees.find(...).count()
db.employees.find(...).limit(n)
db.employees.find(...).skip(n)
db.employees.find(...).sort(...)
db.employees.findOne([query])
db.employees.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } )
db.employees.getDB() get DB object associated with collection
db.employees.getIndexes()
db.employees.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
db.employees.mapReduce( mapFunction , reduceFunction , {optional params=""} )
db.employees.remove(query)
db.employees.renameCollection( newName , {droptarget} ) renames the collection.
db.employees.runCommand( name , {options} ) runs a db command with the given name where the first param is the collection name
db.employees.save(obj)
db.employees.stats()
db.employees.storageSize() - includes free space allocated to this collection
db.employees.totalIndexSize() - size in bytes of all the indexes
db.employees.totalSize() - storage allocated for all data and indexes
db.employees.update(query, object[, upsert_bool, multi_bool])
db.employees.validate( {full} ) – SLOW
db.employees.getShardVersion() - only for use with sharding
db.employees.getShardDistribution() - prints statistics about data distribution in the cluster ...

Just to show a snake eatings its own tail, you can even get help about help as follows:

> help
	db.help()                    help on db methods
	db.mycoll.help()       help on collection methods
	rs.help()                     help on replica set methods
	help admin                administrative help
	help connect             connecting to a db help
	help keys                    key shortcuts
	help misc                    misc things to know
	help mr                       mapreduce
> help
…
show dbs                     show database names
show collections        show collections in current database
show users                  show users in current database
show profile                show most recent system.profile entries time>= 1ms

show logs                    show the accessible logger names
show log [name]        prints out the last segment of log in memory, 

The MongoDB console reminds me of a cross between the MySQL console and Python's console. Once you use the console, it is hard to imagine using a NoSQL solution that does not have a decent console (hint, hint).

Conclusion

The poorly named NoSQL movement seems to be here to stay. The need to have reliable storage that can be easily queried without the schema migration pain and scalability woes of traditional RDBMS is increasing. Developers want more agile systems without the pain of schema migration. MongoDB is a good first step for developers dipping their toe into the NoSQL solution space. It provides easy to use tools like its console and provides many language drivers. This article covered the basics of MongoDB architecture, caveats and how to program in MongoDB for Java, PHP, JavaScript and Python developers.

I hope you enjoyed reading this article 1/2 as much as I enjoyed writing it. --Rick Hightower

Resources

About the Author

Rick Hightower, CTO of Mammatus, has worked as a CTO, Director of Development and a Developer for the last 20 years. He has been involved with J2EE since its inception. He worked at an EJB container company in 1999, and did large scale Java web development since 2000 (JSP, Servlets, etc.). He has been working with Java since 1996 (Python since 1997), and writing code professionally since 1990. Rick was an early Spring enthusiast. Rick enjoys bouncing back and forth between C, PHP, Python, Groovy and Java development. He has written several programming and software development books as well as many articles and editorials for journals and magazines over the years. Lately he has been focusing on NoSQL, and Cloud Computing development.

 

 

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Informative by brian dilley

I hadn't read about the Aggregation Framework yet - very interesting stuff!

Great Informative Post by omprakash vatlam

This is awesome Rick, Its one of the best article I have read apart from mongo db documentation. Its a great article for starters. I did spent quite a bit of time to learn about NoSQL and MongoDB. If this article was there, I could have easily chosen mongodb from huge list of NoSQL DBs and learned in a flash. Thanks for writing up a post. Keep up the great work.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT