MongoDB Sharding: The Mistake That Cost Us a Weekend

We sharded on userId because it seemed obvious — our biggest collection was user events, and userId was in every query. Two weeks later we had a hot shard serving 70% of traffic while the other three sat mostly idle. We spent a weekend migrating.

Why userId was wrong for us

Our traffic wasn't evenly distributed across users. The top 1% of users generated 40% of events — power users who were active all day. Sharding on userId meant those users' data all landed on the same shard, and that shard got hammered.

MongoDB's balancer moves chunks, not documents. Once a shard key is set, you can't change it without dropping and rebuilding the collection. We had 200GB of data and no downtime window. The migration was ugly.

What we should have done

For time-series event data, a compound shard key of { userId: 1, timestamp: 1 } would have distributed writes more evenly because the timestamp component adds entropy across the keyspace. Even better would have been a hashed shard key on userId for pure write distribution, accepting that range queries would hit multiple shards.

The right question isn't "what field is in all my queries" — it's "will writes be evenly distributed across this key's value space". Those are different questions.

Checking shard balance before it becomes a problem

Run this in mongos every week if you're sharded:

db.collection.getShardDistribution()

If any shard is carrying more than 30% above the average chunk count, you have a hot shard developing. The balancer won't fix a bad shard key — it just moves chunks, and if your data is skewed, the chunks are too.

We now do a shard key review before any new collection goes to production. It's a 10-minute conversation that could save a weekend.

MongoDB Sharding: The Mistake That Cost Us a Weekend

Why userId was wrong for us

What we should have done

Checking shard balance before it becomes a problem

Responses0