Racing the cache

There was a rather annoying bug discovered at work that we could only trigger in the live environment, but never reproduce it ourselves. The gist of it was that a group was created, but didn’t exist in the cached group tree leading to an uncaught exception when trying to go up the tree from the group.

Since the cache was being properly invalidated on group creation, and given our inability to reproduce the bug, we eventually narrowed it down to a race condition between two of the most annoying parts of a system to debug — the cross-request cache and database transactions.

What happens when creating a group is that a transaction is started and the group is entered into the database. This clears the group tree cache. Then, the group is moved so that it’s at the bottom of the ordering. Finally, the transaction is committed.

What was happening to cause the bug was the group tree was being recreated between the cache being cleaned and the new group being written to the database. That was not fun to track down.

The solution we went with was to invalidate the group tree cache after committing the transaction. This is a fairly naive solution, but is the best for us without having to put in a large amount of work to have something nicer.

Some things I’d like to try to build, and may go over in later posts, to provide a more robust solution to this are:

  • If in a database transaction, maintain a list of all the cache removals, and re-run them automatically when committing a transaction.
  • Include transactional support in the caching layer. We’re using Redis and the closest it has is pipelining, so will need a custom middle layer to be able to do this.

What about you? What would you suggest as a possible solution to this? Have you had experience with this sort of problem before? Let me know!