You can count on indexes, can't you?

Well... Thank God It's Friday.

It’s always good to know the size of your tables if you support JOINs. A query planner uses this information to prepare query plans as the performance of joins heavily relies on their order. If you have three tables: Person(id, name), Gender(id_person, gender) and Height(id_person, height) and you want to get a list of women taller than 220cm you may want to join Person and Height first followed by Gender. The reason is there are very few people that tall and you will discard most of the results right away joining further only what remains. For this reason relational databases keep so-called “statistics” and if you don’t mind (slightly) stale result you may get better off by taking those values instead of doing select count(*) on a table.

In MongoDB you can do the same.

db.collection.stats().count

Gives you just that. Let’s now get a real number:

db.collection.count()

Just as fast! The magic behind this query is that it gets you the same stale number without explicitly even saying this. You need to use explain to find the villain:

db.a.explain().count().queryPlanner.winningPlan
{
  "stage": "RECORD_STORE_FAST_COUNT"
}

MongoDB documentation states: “Avoid using the db.collection.count() method without a query predicate since without the query predicate, the method returns results based on the collection’s metadata, which may result in an approximate count.”.

And yes: “{}” does not help. You will need the likes of:

db.a.explain('executionStats').count({'nonexistingfield':{$exists:false}})
db.a.aggregate([{$count:"aa"}])

to force the engine to really go through all the documents which gets reflected in the query plan:

          "executionTimeMillis": 146,
          "totalKeysExamined": 0,
          "totalDocsExamined": 311000,
          "executionStages": {
            "stage": "COLLSCAN",
            "nReturned": 311000
            }

As you can see it really took time to return the result as it went thorugh all the documents.

What if you want to count only specific documents? Index is there to help as you can count the number of matching documents by traversing only the index which may significantly reduce the amount of work. For our database containing people generated with “faker profile” we define an index on the job column. Let’s have a look at some stats:

db.a.count()
311000
db.a.distinct('job').length
639
db.a.distinct('sex').length
2

639 different jobs and only 2 different genders within 311000 records.

Now we can count the documents:

db.a.count({"job":"Pharmacologist"})
493

      "stage": "COUNT",
      "inputStage": {
        "stage": "COUNT_SCAN",
        "keyPattern": {
          "job": 1
         }
...
    "executionTimeMillis": 1,
    "totalKeysExamined": 494,
    "totalDocsExamined": 0,
...
        "indexBounds": {
          "startKey": {
            "job": "Pharmacologist"
          },
          "startKeyInclusive": true,
          "endKey": {
            "job": "Pharmacologist"
          },
          "endKeyInclusive": true
...          
          }}

All good here. Only relevant index entries scanned and count taken in an astonishing 1 millisecond.

db.a.count({"job":{$ne: "Pharmacologist"}})
310507

      "stage": "COUNT",
      "inputStage": {
        "stage": "IXSCAN",
        "keyPattern": {
          "job": 1
        },
...
    "executionTimeMillis": 177,
    "totalKeysExamined": 310508,
    "totalDocsExamined": 0,
...
      "indexBounds": {
          "job": [
            "[MinKey, \"Pharmacologist\")",
            "(\"Pharmacologist\", MaxKey]"
          ]
...
}}

Stage name has changed but the process looks similar. We traversed the index up to and then from the field we wanted to exclude.

Another field that we have in our data set is “blood_donor” present only in 195593 rows out of 310507 containing a blood type the person donates. Let’s do some checks here:

db.a.count({"blood_donor":"0-"}
24697

                "executionTimeMillis" : 19,
                "totalKeysExamined" : 24698,
                "totalDocsExamined" : 0,

All good here. No docs are scanned unnecessarily (if we have a proper index). Good execution time.

> db.a.count({"blood_donor":{$ne:"0-"}})
286303

                "executionTimeMillis" : 160,
                "totalKeysExamined" : 286304,
                "totalDocsExamined" : 0

Same.

db.a.count({"blood_donor":{$ne:null}})
195593
                "executionTimeMillis" : 151,
                "totalKeysExamined" : 195594,
                "totalDocsExamined" : 0,

Same. Bear with me.

db.a.count({"blood_donor":{$exists:true}})
195593
                
                "executionTimeMillis" : 455,
                "totalKeysExamined" : 311000,
                "totalDocsExamined" : 311000

And this is the “WAT” part. Why we had to visit all the keys and not only relevant ones and why the hell we had to touch any document at all are the questions to ask.

db.a.count({"blood_donor":{$exists:false}})
115407
                "executionTimeMillis" : 159,
                "totalKeysExamined" : 115407,
                "totalDocsExamined" : 115407,

WAT². We visited only relevant keys but also touched the docs.

This was a vanilla index on “blood_donor” field and its size was 1167360 bytes. We can create sparse index to index only existing fields:

db.a.createIndex({"blood_donor":1},{"sparse":true})

Now the size is only 753664 bytes which makes sense.

db.a.count({"blood_donor":{$exists:true}})
195593
                "executionTimeMillis" : 76,
                "totalKeysExamined" : 195594,
                "totalDocsExamined" : 0,

Now we’re good.

db.a.count({"blood_donor":{$exists:false}})
115407
                "executionTimeMillis" : 166,
                "totalKeysExamined" : 0,
                "totalDocsExamined" : 311000,

One may think that the way to go is to skip the index entries we have and scan only remaining docs but that’s not how it works and we’re satisfied here.

There’s a third option though. We can create an index providing a partial expression to index only chosen documents: db.a.createIndex({“blood_donor”:1},{“partialFilterExpression”:{“blood_donor”:{$exists:true}}})

This is a bit of abuse of course, but the result is seemingly identical: index size: 753664 bytes.

db.a.count({"blood_donor":{$exists:false}})
115407
                "executionTimeMillis" : 179,
                "totalKeysExamined" : 0,
                "totalDocsExamined" : 311000,

Fine.

db.a.count({"blood_donor":{$exists:true}})
195593
                "executionTimeMillis" : 315,
                "totalKeysExamined" : 195593,
                "totalDocsExamined" : 195593,

Docs disturbed again.

db.a.count({"blood_donor":{$in:[ "0+", "0-", "A+", "A-", "AB+", "AB-", "B+", "B-" ]}})
195593
                "executionTimeMillis" : 108,
                "totalKeysExamined" : 195593,
                "totalDocsExamined" : 0,

Fine. And the funny bit now:

db.a.count({"blood_donor":{$gt:MinKey}})
195593
                "executionTimeMillis" : 89,
                "totalKeysExamined" : 195594,
                "totalDocsExamined" : 0,

This says: give me all the docs with “blood_donor” greater than it’s minimal value. This to me looks like an equivalent to $exists but here for some reason magic works.

…and yes I know you noticed. The field “totalKeysExamined” returns 195593 or 195594 depending on the query and I have no idea why that is. If you know please write it in the comments’ section.

Time to finish this post. I was free to spend some time writing it today as it’s:

db.col.insert({"date":ISODate()})
db.col.findOne().date
ISODate("2020-11-20T20:40:54.271Z")

> db.d.aggregate([{$project:{dayOfWeek: { $dayOfWeek: "$date" }}}])
{ "_id" : ObjectId("5fb8544ff1243285a822e0cc"), "dayOfWeek" : 6 }

Satur... wait

> db.d.findOne().date.getDay()
5

Friday!

You can count on indexes, can't you? Well... Thank God It's Friday.

You can count on indexes, can't you?

Well... Thank God It's Friday.