Thursday, November 25, 2010

What I learned from building on AppEngine

WordGap is an anagram search tool for Scrabble players. I built it over two weekends as a coding exercise to learn more about Python and Google AppEngine. I was aware of AppEngine's many limitations before designing the simple web app, yet I soon got bitten by a few undocumented quirks.

IOError: [Errno 24] Too many open files (the local server for development) leaks file handlers and throws an IO exception "too many open files" once a few thousand entities are written to the local datastore. This seems to be a regression bug, which will hopefully get fixed in the next SDK release. What really surprised me is that dev_appserver rewrites the entire datastore file every time db.put() is called.

How not to store 178,000 words in Datastore

WordGap needs to quickly search anagrams from 178,000+ words in TWL. It seemed reasonable to me to dump the word list into the datastore. So I wrote an incremental datastore initialization script that stores a thousand words at a time to work around AppEngine's 30-second request deadline. It took me almost 40 minutes to construct the datastore in production, using up 2.55 CPU hours (39% of free daily quota). I churned out another 2.97 CPU hours in 45 minutes to update the database schema, during which time I realized the rudimentary capabilities of AppEngine's query engine, which essentially support inequality operators only, are inadequate to fulfill WordGap's search needs — WordGap should be able to find words comprised of letters from a given set of letters, including blank tiles (wildcard letters), preferably under 500 milliseconds. I can't think of an efficient datastore query without creating an astronomical amount of index.

How not to store 178,000 words in memory

It then occurred to me that I could simply keep all the data in memory. With pre-calculated bitmasks, the whole database is well under 5MB without any compression. AppEngine is surely capable of handling that, right? Well, not if you store all words in a Python dictionary with 178,000+ keys like I tried initially. That works flawlessly on the local dev_appserver but totally chokes AppEngine in production with this error message: "Exceeded soft memory limit with 299.965 MB after servicing 1 requests total". Apparently Python's dictionary/hashmap implementation has a huge memory overhead. I eventually converted the database to two sorted lists to keep the average memory consumption under 40MB and achieve an average request latency of ~200ms. The initial request could be twice as slow, though, because AppEngine needs extra time to warm up an app instance. It happens pretty frequently to low traffic apps like WordGap.

All that said, I'm still quite pleased with AppEngine overall. Deployment and maintenance just can't be simpler. It's an awesome rapid web application development tool if you're fully aware of its limitations.


  1. FYI, has a --use_sqlite option that makes it store the local datastore in a sqlite db, which fixes the "Too many open files" error for me.

  2. Could you give some more details on your final data structure? I'm guessing something like this (which does not take into account duplicate letters in a word):

    >>> from itertools import imap
    >>> from operator import mul
    >>> def bitmap(word):
    return sum(imap(mul, bitmap.bits, [c in set(word) for c in bitmap.letters]))

    >>> bitmap.bits = [2**i for i in range(26)]
    >>> bitmap.letters = map(chr, range(ord('a'), ord('z')+1))
    >>> twl = 'pots stop tops spot united untied'.split(' ')
    >>> d=dict()
    >>> for word in twl:
    d.setdefault(bitmap(word), []).append(word)

    >>> d
    {835584: ['pots', 'stop', 'tops', 'spot'], 1581336: ['united', 'untied']}

  3. @samwyse: That's quite similar to my implementation except that I stored precomputed bit-masks in a list rather than a dictionary to reduce memory consumption.