A better iterator for couchdb-python

The most used python library for CouchDB seems to be couchdb-python. It is a simple little library that makes accessing a couch database a breeze, but it has a serious limitation when using views: The first thing it always does is to fetch all rows in the view. Then it processes them and keeps them as a list in memory. Because it is using simple data types (as opposed to a C extension or at least records), these lists have a huge overhead of about 1 KB per row (ouch!). And, this does not only affect custom views, but also the _all_docs view that is used to iterate over all documents in the database.

The bottom line is that a simple program such as the following can bring couchdb-python to a crawling halt as it tries to load all document ids into memory:

import couchdb
couch = couchdb.Server()
db = couch['mydatabase']

# This is dangerous.
for doc in db:
    pass

The problem is that this is the code shown in the documentation, and that it works just fine for small databases (say, up to 100,000 rows). But once you scale up the workload in your database, it starts to degrade. Python programmers will recognize that this is the same problem as with the range function, and that was fixed by xrange. Unfortunately, the authors of couchdb-python repeated this mistake. This is not good. Database software is all about scaling, and it is easy enough to design your software from ground up around that principle.

To show how it can be done, I wrote a paginating iterator for couchdb-python. Now, pagination in CouchDB is slightly convoluted, but not terribly so. It’s only a couple lines of code:

def couchdb_pager(db, view_name='_all_docs',
                  startkey=None, startkey_docid=None,
                  endkey=None, endkey_docid=None, bulk=5000):
    # Request one extra row to resume the listing there later.
    options = {'limit': bulk + 1}
    if startkey:
        options['startkey'] = startkey
        if startkey_docid:
            options['startkey_docid'] = startkey_docid
    if endkey:
        options['endkey'] = endkey
        if endkey_docid:
            options['endkey_docid'] = endkey_docid
    done = False
    while not done:
        view = db.view(view_name, **options)
        rows = []
        # If we got a short result (< limit + 1), we know we are done.
        if len(view) <= bulk:
            done = True
            rows = view.rows
        else:
            # Otherwise, continue at the new start position.
            rows = view.rows[:-1]
            last = view.rows[-1]
            options['startkey'] = last.key
            options['startkey_docid'] = last.id

        for row in rows:
            yield row.id

It is used like this (but could also be integrated in couchdb-python as the __iter__ method):

import couchdb
couch = couchdb.Server()
db = couch['mydatabase']

# This is always safe.
for doc in couchdb_pager(db):
    pass

The important concept in the implementation is the bulk size that specifies how many records should be requested at once. I have measured wall time and memory use of the above code as a function of the number of records per request. The couchdb-python implementation behaves as if bulk is infinite, which puts it at the far right of the diagram.

The diagram does not show the whole truth: Even though couchdb-python is at the far right, that does not mean it is the fastest solution, for two reasons:

  • At high memory pressure, the performance of the operating system degrades. I can’t show this in the diagram, as I have more RAM than python needs for my database. But for small RAM or large database configurations, the performance will get bad very quickly as the system starts to swap pages in and out of memory.
  • The diagram does not show the latency of the implementation. The latency is the time between processing any two rows. The higher the number of records per request is, the higher is the latency between the last row of a batch and the first row of the next one. There are various aspects to latency (maximum lag, jitter, etc). The latency behavior of couchdb-python is the most uneven possible: First it loads all rows into memory, then processes them all. That means you have to wait a long time until you can process the first row, although then you can process all rows immediately. Most of the time it’s better to process data more evenly in smaller bursts.

A couple of notes on the measured performance data: For small bulk sizes, the overhead due to the high number of small requests increases quickly. The workload shifts towards the CouchDB server (1.6 times the CPU usage of Python). For average bulk sizes, the throughput approaches its optimum, and the workload evens out at 1.2 times the CPU usage of Python for the CouchDB server. Memory usage in Python increases linearly with the bulk size once the initial allocation at 10 MB is exhausted. CouchDB has constant memory use at 20 MB throughout the whole measurement, no matter what the bulk size.

So, what is the optimum bulk size? Clearly, too small bulk sizes increase the overhead due to the high number of small requests. But too large bulk sizes increase the memory footprint and latency problems. The right value is somewhere in the middle. Where exactly? For the given machine configuration, I put it at 5000. There is no increase in throughput for larger bulk sizes, while for smaller bulk sizes, there is a measurable decrease in throughput. Also, I favor real time over memory use, for a simple reason: The only way I can reduce the wall time is by increasing the bulk size. But memory use can be optimized in Python by a factor of 100 by using better data structures to store the row data. So, throughput and latency are the important factors, while memory use is only a secondary concern.

This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

5 Responses to A better iterator for couchdb-python

  1. Daniel Gonzalez says:

    I have just integrated this into a branch of couchdb python and it works very well. I have also added a bulk_get_docs function, lacking in couchdb-python, to get several documents at the same time. This further improves the overall responsiveness. This is my code:

    def bulk_get_docs(self, doc_ids, **options):
    “”” Get several documents in one request, and yield them for consumption”””
    content = {‘keys’ : doc_ids}
    options.update(include_docs = True)
    _, _, data = self.resource.post_json(‘_all_docs’, body=content, headers={‘Content-Type': ‘application/json’}, **options)
    for row in data['rows']:
    yield row['id'], row['doc']

    I am considering a pull request to the couchdb-python mantainers, if you agree to include your code also for the couchdb_pager.

    • Marcus says:

      Oh, sure, I hereby waive whatever rights I might have on the couchdb_pager code, and place it in the public domain. Pull away :)

  2. Jason Priem says:

    Thanks so much for this! Helped me out a lot today.

  3. Wow. Perfect. Thanks!

  4. Wow, you totally saved my ass. Very nice function. =)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>