Pages

Wednesday, March 30, 2011

Creating large in-memory database on AppEngine

For high performance AppEngine app development, it often comes in handy to store read-only database in memory. Accessing in-memory data is not only orders of magnitude faster than making DataStore or memcached calls but also a lot cheaper as it does not incur any API cost.

However, if you encode data as a Python dictionary or list with a million entries, your app will most likely crash on AppEngine, throwing a distasteful exceptions.MemoryError: Exceeded soft process size limit with 299.98 MB. "But I'm only loading 10MB of data!", you proclaim. Unfortunately, Python may temporarily consume over a gigabyte of memory while parsing and constructing a multi-megabyte dictionary or list.

The first thing you should consider is to simplify the data structure. If possible, flatten your database into one-dimensional lists, which enjoy a smaller memory footprint than dictionaries and multi-level nested lists.

Next, try data serialization using the pickle library. Be sure to use protocol version 2 for maximum efficiency and compactness. For example:
# To serialize data
pickle.dump(data, open('data.bin', 'w'), pickle.HIGHEST_PROTOCOL)
# To deserialize data
data = pickle.load(open(os.path.join(os.path.dirname(__file__), 'data.bin'), 'r'))


As AppEngine does not support the much faster cPickle module ("cPickle" is aliased to "pickle" on AppEngine), your app may time out if you try to unpickle millions of records. One effective solution is to store your data in homogeneous arrays to take advantage of array's highly efficient serialization implementation. Suppose you have a list of a million signed integers, you may first convert the list into a typed array and save it in a binary file:
array.array('i', data).tofile(open('data.bin', 'w'))
Deserializing the array literally takes just a few milliseconds on AppEngine:
data = array.array('i')
data.fromfile(open(os.path.join(os.path.dirname(__file__), 'data.bin'), 'r'), 1000000)


One caveat: To load more than 10MB of data, you will have to split the database into multiple files to work around AppEngine's size limit of static files.

No comments:

Post a Comment