Query Semantic MediaWiki with Angular through CORS

I have a private MediaWiki with the Semantic MediaWiki extensions, to keep some personal data. Wouldn’t it be nice to query that data from some other server, or from a web app? Semantic MediaWiki has a nice API that allows us to get data in JSON format. But we need to defeat the Same-Origin-Policy that protects our servers from evil code. JSONP is a well-known method that works, but only for anonymous requests on public wikis. Here is another approach that works with closed wikis, too.

MediaWiki

In LocalSettings.php, you need to allow CORS. You can use the *-wildcard or a list of allowed domains to query your MediaWiki instance:

$wgCrossSiteAJAXdomains = array( '*' );

Also, all API requests will take an origin parameter that repeats the domain from which the request came. This is very annoying, but the MediaWiki developers were concerned about implemented caching properly and efficiently, and this is the solution they came up with.

I am running the example code in a local server with

python -m SimpleHTTPServer 8000

so the origin parameter should be http://localhost:8000 and I don’t need to disable strict origin policy checking for file URIs in my browser.

AngularJS

Here you need to configure the http service provider to allow cross-domain requests. We also configure it to send credentials along with a request globally.

   app.config(function($httpProvider) {
        $httpProvider.defaults.useXDomain = true;
        $httpProvider.defaults.withCredentials = true;
   });

Logging in

To log in, you need to send a POST request to the MediaWiki API, and follow it up with another POST request to confirm (older versions only require one step). I put this in a controller:

$http.post('http://example.com/wiki/api.php', '',
           { params: { origin: 'http://localhost:8000',
                       format: 'json',
                       action: 'login',
                       lgname: 'marcus',
                       lgpassword: 'secretpassword' },
             /* Prevent CORS preflight.  */
             headers: { "Content-Type": "text/plain" }
            }).success(function(data) {
              if (data.login.result == 'NeedToken') {
                $http.post('http://example.com/wiki/api.php', '',
                           { params: { origin: 'http://localhost:8000',
                                       format: 'json',
                                       action: 'login',
                                       lgname: 'marcus',
                                       lgpassword: 'secretpassword',
                                       lgtoken: data.login.token },
                             /* Prevent CORS preflight.  */
                             headers: { "Content-Type": "text/plain" }
                            }).success(function(data) {
                              $http.get('http://example.com/wiki/api.php',
                                        { params: { origin: 'http://localhost:8000',
                                                    format: 'json',
                                                    action: 'ask',
                                                    query: '[[Category:Contact]]'
                                                  }
                                       }).success(function(data){
                                         pim.contacts = data.query.results;
                                       });
                            });
              }
            });

Not the prettiest of code. There is a lot of error handling missing, and so on. But it should get you going. The login process will store session cookies in the http service provider, which are sent in the following API requests. Of course, you can also query wiki pages with the parse action, etc., as normal.

The special content-type header prevents the pre-flight OPTIONS requests that are specified by CORS, and that are not supported by MediaWiki. If you see unhandled OPTIONS requests in your network log, then you need to take a closer look at the content-type header. I don’t know yet if that is a concern for downloading images from the MediaWiki server. If you try it, leave a comment!

Posted in Programming | Tagged , , , , , | Leave a comment

Cython Trouble

Here are a couple of things I experienced using Cython to wrap my C++ library grakopp:

Assignment and Coercion

  • I couldn’t find a nice way to wrap boost::variant. Although the direct approach works, an assignment to the variant requires an unsafe cast, but that also adds the overhead of a copy. To work around this, I used accessor functions (requires changing the C++ implementation).
  • The operator= is not supported to declare custom assignment functions.
  • There is no other way to add custom coercion rules. The support for STL container coercion is hardcoded in Cython/Compiler/PyrexTypes.py. This also makes the builtin coercions less useful.
  • string coercion seems to be unhappy quite often, so you have to cast to string or char* even constant strings.

Imports

  • Relative cimports are not supported.
  • Unintuitively, a corresponding pxd file is automatically included, which can not be supressed. So renaming its imports with “as” in a cimport is not possible.
  • There is no sensible place to put shared pxd files. [Wiki] [Forum]

References and Initializers

  • References can not be initialized with an assignment, so pointers or slow copies have to be used everywhere.
  • Only the default constructor can be used to instantiate stack or instance variables.

Overloading

  • Cython could not resolve function overloading which differed only in the constness of the return type.

Templates

  • It’s not possible to write meta-extension classes for template classes directly – only extension types for specific instantiations. [Forum]
  • It’s not possible to use non-type template arguments (as a workaround, you can use ctypedef int THREE "3"). [Forum]
  • It’s also not possible to instantiate function templates, but there is a similar workaround. [Forum]
Posted in Programming | Tagged , , , | Leave a comment

Manual symbol version override

I had to deal with a proprietary software library that wouldn’t run on CentOS 6.5, because the library was compiled against a newer glibc (>= 2.14) while CentOS was running on glibc 2.12. Actually, there was only one symbol versioned later than 2.11, which was memcpy@2.14. It turns out that this is due to a well-known optimization (nice discussion with links [here).

Normally, one would install an appropriate version of the library and set LD_LIBRARY_PATH accordingly, but for libc that ain’t so easy, because you also need a matching runtime linker, and the kernel will always use /lib64/ld-linux-x86-64.so.2 or whatever is found in the INTERP program header of the ELF executable. You can run the matching linker manually, but this only works for the invoked processes, not its children. It’s a huge PITA, and short of a chroot filesystem or other virtualization I don’t know a good way to replace the system C library (if you know a way, leave a comment!).

Anyway, I decided to patch the binary. First, I checked the older version of the library, and saw that it required memcpy@2.2.5. So here we go:

$ readelf -V libfoo.so

Version symbols section '.gnu.version' contains 3627 entries:
 Addr: 000000000003688e  Offset: 0x03688e  Link: 2 (.dynsym)
  000:   0 (*local*)       0 (*local*)       1 (*global*)      1 (*global*)
....
  d7c:   9 (GLIBC_2.14)    1 (*global*)      1 (*global*)      1 (*global*)
....

Version needs section '.gnu.version_r' contains 4 entries:
 Addr: 0x00000000000384e8  Offset: 0x0384e8  Link: 3 (.dynstr)
  000000: Version: 1  File: libgcc_s.so.1  Cnt: 1
  0x0010:   Name: GCC_3.0  Flags: none  Version: 8
  0x0020: Version: 1  File: libstdc++.so.6  Cnt: 3
  0x0030:   Name: CXXABI_1.3.1  Flags: none  Version: 7
  0x0040:   Name: CXXABI_1.3  Flags: none  Version: 6
  0x0050:   Name: GLIBCXX_3.4  Flags: none  Version: 4
  0x0060: Version: 1  File: libc.so.6  Cnt: 3
  0x0070:   Name: GLIBC_2.14  Flags: none  Version: 9
  0x0080:   Name: GLIBC_2.11  Flags: none  Version: 5
  0x0090:   Name: GLIBC_2.2.5  Flags: none  Version: 3
  0x00a0: Version: 1  File: libm.so.6  Cnt: 1
  0x00b0:   Name: GLIBC_2.2.5  Flags: none  Version: 2

Let’s check if reference 0xd7c (== 3452) really is memcpy:

$ readelf --dyn-syms libfoo.so|grep $((0xd7c))
  3452: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND memcpy@GLIBC_2.14 (9)

We look up the specification for the layout of these .gnu.version sections. The plan is to copy the entry for GLIBC_2.2.5 into the entry for GLIBC_2.14, so that all references to version “9” go to glibc 2.2.5 instead of 2.14.

$ od -j $((0x384e8+0x60)) -N $((0x40)) -Ax -t x1 libfoo.so
038548 01 00 03 00 ed b9 01 00 10 00 00 00 40 00 00 00
038558 94 91 96 06 00 00 09 00 43 ba 01 00 10 00 00 00
038568 91 91 96 06 00 00 05 00 4e ba 01 00 10 00 00 00
038578 75 1a 69 09 00 00 03 00 59 ba 01 00 00 00 00 00

We can see the verneed (“version needed”) entry for libc here, together with three vernaux (“version needed auxiliary”) entries. Each vernaux entry consists of a 4-byte hash value for the version name (for faster comparison than strcmp, here 0x06969194 for GLIBC_2.14), 4 bytes flags and other information (such as the version number referenced by the .gnu.version section in the last byte), a 4-byte offset into the string table with the human-readable version string, and a 4-byte length for the entry (always 0x10).

We want to keep the indirectly referenced version number (“9″), so there are no duplicate entries, but copy the hash and string pointer values. Of course, the next offset stays, too. After editing with a hex editor, we have:

$ od -j $((0x384e8+0x60)) -N $((0x40)) -Ax -t x1 libfoo.so
038548 01 00 03 00 ed b9 01 00 10 00 00 00 40 00 00 00
038558 75 1a 69 09 00 00 09 00 59 ba 01 00 10 00 00 00
038568 91 91 96 06 00 00 05 00 4e ba 01 00 10 00 00 00
038578 75 1a 69 09 00 00 03 00 59 ba 01 00 00 00 00 00

Let’s see if this worked:

$ readelf -V libfoo.so
...
  0x0060: Version: 1  File: libc.so.6  Cnt: 3
  0x0070:   Name: GLIBC_2.2.5  Flags: none  Version: 9
  0x0080:   Name: GLIBC_2.11  Flags: none  Version: 5
  0x0090:   Name: GLIBC_2.2.5  Flags: none  Version: 3
...

$ readelf --dyn-syms libfoo.so|grep $((0xd7c))
  3452: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND memcpy@GLIBC_2.2.5 (9)

There is an extra memcpy@@2.14 reference, but no entry for it in the version table. I can get rid of that with strip --strip-unneeded, if I want to.

This seems to work for me just fine, and in fact it would have worked even if there wasn’t a GLIBC_2.2.5 entry already, but an entry for some other version. However, if there are more symbols to deal with, we might actually need to edit the actual symbol versions in the .gnu.version section (change the “9” into a “3” in this case), or do more complicated editing.

Posted in Programming | Tagged , , | Leave a comment

Virtualization Technologies

Virtualization technology is moving fast, and what used to be hot yesterday is as cold as ice today. There is a lot of material to digest, and a lot of documentation that seems somewhat relevant but can be out of date. Surely, this blog post will suffer the same fate, but nevertheless, here it is: A quick list of the most relevant and up to date technology that I could find to set up a small cloud.

KVM provides low-level access to hardware virtualization.

Virtio provides I/O paravirtualization to grant guest systems faster, more direct access to host system peripherals.

Qemu has full support for KVM and virtio. As an extra bonus, it also supports legacy I/O virtualization as well as full system emulation (for example, running ARM systems on X86 hardware). It’s the glue that binds things together to a complete virtualized machine (system board and peripherals).

libvirtd is daemon to manage virtual machine instances and the underlying storage and network devices. This is the productive level for actual user interaction (while the above are building blocks used only indirectly through the libvirtd interface). libvirtd uses policykit for access control. In addition to the CLI tool virsh, there are many other tools that build on libvirtd, some graphical, such as virt-manager.

For a small personal cloud, libvirtd with its basic tools, command line and graphical, may be all you need. Of course, enterprisey users may require more complete management interfaces such as OpenStack etc.

As for the operating system images, it is possible to install from scratch, but that is very old style. Today, most vendors provide OpenStack images in qcow2 format, which can also be used with various cloud providers, which can be very convenient to use. These are basically pre-installed systems with the cloud-init utility that runs at boot time and looks in various places for special configuration files. Beside fancy provisioning servers, it is possible to use a simple ISO image with meta-data.yml and user-data.yml files.

And there you have it, a blue print for your own personal cloud. Once you picked up these basic building blocks, understanding integrated solutions such as OpenStack should be much less confusing. Or at least that’s what I am hoping.

Posted in Programming | Tagged , | Leave a comment

Compiling FreeCad on Fedora 20

FreeCad is a very promising free and portable CAD program. Unfortunately, it’s dependency chain is a bit messy, and building those libraries is not for the faint of heart. Normally, GNU/Linux distributions do a good job on that for you, but in Fedora, the packaging is not quite up to date. The included FreeCad 0.13 works, kinda, but there are crashes and bugs like missing text rendering in Draft mode. As FreeCad is progressing fast, it is useful to build the latest version, and here is how to do just that on Fedora 20.

First, you install the dependencies, except for coin, soqt and python-pivy.

 $ sudo yum install cmake doxygen swig gcc-gfortran gettext dos2unix desktop-file-utils libXmu-devel freeimage-devel mesa-libGLU-devel OCE-devel python python-devel boost-devel tbb-devel eigen3-devel qt-devel qt-webkit-devel ode-devel xerces-c xerces-c-devel opencv-devel smesh-devel freetype freetype-devel libspnav-devel

Then you download and install the Coin3 source package by corsepiu:

 wget http://corsepiu.fedorapeople.org/packages/Coin3-3.1.3-4.fc19.src.rpm
 $ rpm -i Coin3-3.1.3-4.fc19.src.rpm

Which you can now build and install:

 $ rpmbuild -bb ~/rpm/SPECS/Coin3.spec
 $ sudo rpm -i ~/rpm/RPMS/x86_64/{Coin3-3.1.3-4.fc20.x86_64.rpm,Coin3-devel-3.1.3-4.fc20.x86_64.rpm}

Note that source packages are installed as normal user, while binary packages are installed as root. Verify that Coin3 is your active alternative for coin-config:

 $ alternatives --display coin-config
 coin-config - status is manual.
  link currently points to /usr/lib64/Coin3/coin-config
 ...

If it is not, set it with `alternatives –set coin-config`.

Now you have to rebuild and install a bunch of packages depending on Coin2 in Fedora 20. By rebuilding them, you make them depend on Coin3, which is what FreeCad expects.

 $ yumdownloader --source SoQt SIMVoleon python-pivy
 $ rpm -i SoQt-1.5.0-10.fc20.src.rpm SIMVoleon-2.0.1-16.fc20.src.rpm python-pivy-0.5.0-6.hg609.fc20.src.rpm
 $ rpmbuild -bb ~/rpm/SPECS/SoQt.spec
 $ sudo rpm -i ~/rpm/RPMS/x86_64/{SoQt-1.5.0-10.fc20.x86_64.rpm,SoQt-devel-1.5.0-10.fc20.x86_64.rpm}
 $ rpmbuild -bb ~/rpm/SPECS/SIMVoleon.spec
 $ sudo rpm -i ~/rpm/RPMS/x86_64/{SIMVoleon-2.0.1-16.fc20.x86_64.rpm,SIMVoleon-devel-2.0.1-16.fc20.x86_64.rpm}
 $ rpmbuild -bb ~/rpm/SPECS/python-pivy.spec
 $ sudo rpm -i ~/rpm/RPMS/x86_64/python-pivy-0.5.0-6.hg609.fc20.x86_64.rpm

Now you can finally download and build FreeCad:

 $ git clone git@github.com:lambdafu/FreeCAD_sf_master.git freecad
 $ mkdir build
 $ cd build
 $ cmake ../freecad
 $ make -j 8

This will take a while. When it has finished, you can start FreeCad with:

 $ bin/FreeCad

Have Fun!

Posted in Programming | Tagged , | 3 Comments

OpenSSH authorized_key options by version

This should be in the official documentation, but for what it’s worth:

All versions of OpenSSH support the following options in authorized_keys:
command=””, environment=””, from=””, no-agent-forwarding, no-port-forwarding, no-pty, no-X11-forwarding.

Starting with version 2.5.2, OpenSSH supports permitopen.

Starting with version 4.3, OpenSSH supports tunnel.

Starting with version 4.9, OpenSSH supports no-user-rc.

Starting with version 5.4, OpenSSH supports cert-authority.

Starting with version 5.6, OpenSSH supports principals.

Posted in Uncategorized | Leave a comment

Bayesian inference introduction

I wrote a small introduction to Bayesian inference, but because it is pretty heavy on math, I used the format of an IPython notebook. Bayesian inference is an important process in machine learning, with many real-world applications, but if you were born any time in the 20th century, you were most likely to learn about probability theory from a frequentist point of view. One reason may be that calculating some integrals in Bayesian statistics was too difficult to do without computers, so frequentist statistics was more economical. Today, we have much better tools, and Bayesian statistics seems more feasible. In 2010, the US Food and Drug Administration issued a guidance document explaining some of the situations where Bayesian statistics is appropriate. Overall, it seems there is a big change happening in how we evaluate statistical data, with clearer models and more precise results that make better use of the available data, even in challenging situations.

Posted in Mathematics | Tagged , , , , , | 2 Comments

6 things you didn’t know about MediaWiki

… (and were afraid to ask).

HTML Tag Scope: If you mix HTML tags with wikitext, which is allowed for so-called “transparent tags”, MediaWiki will check the element nesting independent of the wikitext structure in a preprocessing step (include/Sanitizer.php::removeHTMLtags). Later on, when parsing the wikitext, some elements may be closed automatically (for example at the end of a block). The now-dangling close tag will be ignored, although it is detached from its counterpart by then:

<span style="color: red">test
 this</span>

will result in:

test

this

while

test
 this</span>

will result in:

test

this</span>

This can happen across a long part of the wikitext document, with many intermediate blocks, so the treatment of close tags has a wide context-sensitivity, which is generally bad for formal parsing.

Breaking the cssCheck: If a CSS style attribute contains character references to invalid Unicode code points, the page renderer terminates with a fatal error from include/normal/UtfNormalUtil.php::codepointToUtf8 called through include/Sanitizer.php::decodeCharReferencesCallback:

<span style="\110000">x</span>

leads to the fatal error:

Asked for code outside of range (1114112)

It’s a rare chance to see an uncaught exception to leak through to the user, and could be avoided by calling include/Sanitizer.php::validateCodepoint first and falling back to UTF8_REPLACEMENT.

Update:I submitted a patch for this to MediaWiki’s code review platform.

HTML attribute junk: You can write just about anything (except <, > or />) in the attribute space of an HTML opening tag, and MediaWiki will ignore it. This even includes a signature, like in the following example:

<strong !@#$%^&*()_foobar?,./';": style="color: red" ~~~~>test</strong>

yields

test

As long as attributes are separated from junk by whitespace, they are preserved (such as the style attribute above).

Missing block elements: You can avoid generation of paragraph elements (<p>) around inline text by inserting an empty <div style="display:inline"> element on the same line. If you drop the style attribute, the text will be broken in two paragraphs by the browser, tough, and the text before and after the div will not connect to preceding or following inline elements.

Table header data synonymity: In a table, after a !, the table data cell separator || is synonymous with the table header cell separator !!:

{|
! header1 || header2
|}

yields

header1 header2

with two table headers. The opposite does not work, though:

{|
| data1 !! data2
|}

yields

data1 !! data2

Note that this example also introduces a non-breakable space character after “data1″, because MediaWiki interprets the following exclamation mark as french interpunction.

Using indent-pre were it is not allowed: In some places, indent-pre (creating <pre> elements by indenting text lines with a space) is disallowed for compatibility. This affects <blockquote>, <p>, <li>, <dt>, and <dd> elements, and also prevents you from creating new paragraphs and <br/> elements with empty lines. The restriction is only active up to the first block level element, though, so it is easy to avoid it:

<ul><li>test


 this
 <div></div>
and


 this
</li></ul>

yields

  • test this

    and

    this
    

which demonstrates the limitation of the restriction.

Posted in Programming | Tagged , , , | Leave a comment

Replacing native code with Cython

Here is a little exercise in rewriting native code with Cython while not losing performance. It turns out that this requires pulling out all the stops and applying a lot of optimization magic provided by Cython. On the other hand, the resulting code is portable to Windows without worrying about compilers etc.

A real world example

The example code comes from a real world project, OCRopus, a wonderful collection of OCR tools that uses latest algorithms in machine learning (such as deep learning) to transform images to text. In ocropy/ocrolib/distance.py, the authors of OCRopus implement the function cdist, which calculates the Euclidian distances D(na,nb) between all d-dimensional vectors that are the rows of two matrices A(na, d) and B(nb, d). D(i,j) is the distance of vector A(i) and B(j). This function is also provided as scipy.spatial.distance.cdist, but that implementation is limited to a single thread, while the OCRopus version supports OpenMP and thus can run faster on several cores.

cdist_native_c = r'''
#include <math.h>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <omp.h>

int maxthreads = 1;

void cdist(int d,int na,int nb,float a[na][d],float b[nb][d],float result[na][nb]) {
    int n = na*nb;
#pragma omp parallel for num_threads (maxthreads)
    for(int job=0;job<n;job++) {
        int i = job/nb;
        int j = job%nb;
        double total = 0.0;
        for(int k=0;k<d;k++) {
            float delta = a[i][k]-b[j][k];
            total += delta*delta;
        }
        result[i][j] = sqrt(total);
    }
}
'''

If we run the above code on arrays A(1000, 100) and B(800, 100), we get the resulting distance matrix D(1000, 800) in 44ms on a quadcore Q9550 processor, while SciPy’s cdist function takes 67ms on a single core. Sure, the speedup is not linear by any means, which shows how brutally efficient NumPy’s vectorization strategies are – so either the OCRopus team has more powerful hardware than I do, or they decided that the reduced wall time is well worth the loss in overall efficiency.

Cython to the rescue

Unfortunately, OCRopus’ home-baked native code compiler is not very portable, so we would like to use Cython instead, which conveniently has support for NumPy arrays and OpenMP built in. Cython compiles (possibly annotated) Python code to C code, so we start with a slightly different implementation of the above (a first version of the code had a[i][k] etc. to access the matrix elements. That was so horrible that I didn’t even try to measure the slowness):

import numpy as py
def native_cdist(a, b, out):
    na = a.shape[0]
    nb = b.shape[0]
    d = a.shape[1]
    for i in range(na):
        for j in range(nb):
            total = 0
            for k in range(n):
                value = a[i,k] - b[j,k]
                total = total + value * value
            out[i,j] = np.sqrt(total)

The code is similar to the native code in that it does not make use of any vectorization in NumPy. Also, because we didn’t use any type annotations, Cython can not perform any optimizations. The performance is consequently very bad at 10s, 150 times slower than SciPy. A few simple type annotations help a lot:

import numpy as np
cimport numpy as np
DTYPE = np.float32
ctypedef np.float32_t DTYPE_t

cdef native_cdist(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b, np.ndarray[DTYPE_t, ndim=2] out):
    cdef Py_ssize_t na, nb, n, job, d, i, j, k
    cdef DTYPE_t total, value

    na = a.shape[0]
    nb = b.shape[0]
    n = a.shape[1]
    for i in range(na):
        for j in range(nb):
            total = 0
            for k in range(n):
                value = a[i, k] - b[j, k]
                total = total + value * value
            out[i, j] = np.sqrt(total)

This helps tremendously, and the execution time is now down to 130ms, which gets us within a factor of 2 of SciPy’s code. The HTML report of Cython informs us that we are still using Python code in the array access and in the sqrt function. Let’s deal with the latter first:

cdef extern from "math.h":
    double sqrt(double x) nogil

These statements import the C function sqrt, which avoids a round-trip through Python, saving a whopping 2ms (because sqrt is not called in the most inner loop), and we are at 128ms. Before worrying about the NumPy array access, let’s pick some low-hanging fruit:

@cython.boundscheck(False)
@cython.wraparound(False)

Disabling bound checking reduces runtime to 95ms, and disabling negative array indices (which we don’t use) reduces it again to 73ms. We have not quite matched SciPy’s performance, but we are getting there. Really, the only thing left to do is to optimize the array access. To do this, we have to move on to the newer memoryview interface of Cython, that allows us a more precise definition of what an array looks like:

cdef void native_cdist(DTYPE_t [:, :] a, DTYPE_t [:, :] b, DTYPE_t [:, :] out):
    pass

Alas, we can not measure any performance gain. The reason is that we have not told Cython yet what kind of restrictions we want to enforce on the array. In C, a 2-dimensional array uses a contiguous chunk of memory, and a[i][j] is found by calculating i * d + j. The strides (offset between elements along an axis) in this case are (d, 1), because there are d elements from one row to another, and there is 1 element from one column to another. In this case (C), the stride for the k-th axis is just the product of the dimensions of the axes 1 to k-1. But in Python, slicing operations can create different stride layouts, allowing zero-copy access to the slice:

aa = np.arange(60).reshape(5,4,3)
array(aa.strides)/aa.itemsize # Out: array([12,  3,  1])
a = aa[:,:,1]
array(a.strides)/a.itemsize # Out: array([12,  3])

If we can assume that the input matrices to our function do not make use of this feature, we can tell Cython to use native C array access to the data, which saves one multiplication for every access:

cdef void native_cdist(DTYPE_t [:, ::1] a, DTYPE_t [:, ::1] b, DTYPE_t [:, ::1] out):
    pass

This now gives us a single-threaded execution time of 66ms, which is equivalent to SciPy’s cdist implementation.

Making it parallel

To make use of several threads, and thus to match the performance of the OCRopus implementation, we need a GIL-free version of the above code (the GIL is the global interpreter lock that ensures that only a single thread is running in the Python interpreter at any time). Luckily, the above optimizations, in particular the move to the memoryview interface, made our code GIL-free automatically, we just have to declare it thus by appending nogil to the function prototype:

cdef void native_cdist(DTYPE_t [:, ::1] a, DTYPE_t [:, ::1] b, DTYPE_t [:, ::1] out) nogil:
    pass

We are now ready to use the cython.parallel.prange function to run the outer loops in parallel. To do this in an optimal way, we rewrite the two outer loops in the same way the OCRopus code does it:

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
DTYPE = np.float32
ctypedef np.float32_t DTYPE_t
cdef extern from "math.h":
    double sqrt(double x) nogil

@cython.boundscheck(False)
@cython.wraparound(False)
cdef void native_cdist(DTYPE_t [:, ::1] a, DTYPE_t [:, ::1] b, DTYPE_t [:, ::1] out) nogil:
    cdef Py_ssize_t na, nb, n, job, d, i, j, k
    cdef DTYPE_t total, value

    na = a.shape[0]
    nb = b.shape[0]
    n = na * nb
    d = a.shape[1]
    for job in prange(n):
        i = job / nb;
        j = job % nb;
        total = 0
        for k in range(d):
            value = a[i,k] - b[j,k]
            total = total + value * value
        out[i,j] = sqrt(total)

When I ran this code, I was surprised: Although it used four cores alright, it actually ran 76ms, slower than the single-core version. The HTML view in Cython revealed that the division and modulo operators made use of Python code. That makes sense, as these operators behave quite differently in Python than in C. Luckily, there is a pragma available to change them to the default C operators:

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision
cdef void native_cdist(DTYPE_t [:, ::1] a, DTYPE_t [:, ::1] b, DTYPE_t [:, ::1] out) nogil:
     pass

With this change, the performance increased to 38ms, which is even faster than the native version in OCRopus by 10%, a resounding success!

Update: The improvement over native code is revealed by comparing the generated assembler code in each case: I used single-precision float for the total variable, while the OCRopus implementation above uses double. Changing the type in the Cython code brings the performance down to 45ms, which is a 3% slow-down compared OCRopus. Some minor differences remain, which would require even closer scrutiny to figure out.

Testing

Here are the complete source files:

# cdist.pyx
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
DTYPE = np.float32sudo cat /sys/devices/system/cpu/cpu?/cpufreq/cpuinfo_cur_freq
ctypedef np.float32_t DTYPE_t
cdef extern from "math.h":
    double sqrt(double x) nogil

@cython.boundscheck(False)
@cython.wraparound(False)
cdef void native_cdist(DTYPE_t [:, ::1] a, DTYPE_t [:, ::1] b, DTYPE_t [:, ::1] out) nogil:
    cdef Py_ssize_t na, nb, n, job, d, i, j, k
    cdef DTYPE_t total, value

    na = a.shape[0]
    nb = b.shape[0]
    n = na * nb
    d = a.shape[1]
    for job in prange(n):
        i = job / nb;
        j = job % nb;
        total = 0
        for k in range(d):
            value = a[i,k] - b[j,k]
            total = total + value * value
        out[i,j] = sqrt(total)

def cdist(a not None, b not None, out=None):
    # Some additional sanity checks from OCRopus are missing here.
    assert a.dtype == DTYPE and b.dtype == DTYPE
    if out is None:
        out = np.zeros((len(a),len(b)), dtype=DTYPE)
    native_cdist(a,b,out)
    return out
# cdist.pyxbld
def make_ext(modname, pyxfilename):
    from distutils.extension import Extension
    return Extension(name=modname,
                     sources=[pyxfilename],
                     extra_link_args=['-fopenmp'],
                     extra_compile_args=['-fopenmp'])
import pyximport
pyximport.install()
import cdist
from pylab import *
a = array(randn(1000,100),'f')
b = array(randn(800,100),'f')
o1 = cdist.cdist(a, b)
from scipy.spatial.distance import cdist as scipy_cdist
o2 = scipy_cdist.cdist(a, b)
amin(o1-o2)
amax(o1-o2)
Posted in Programming | Tagged , , , , , , , , | Leave a comment

Including binary file in executable

A friend asked how to include a binary file in an executable. Under Windows, one would use resource files, but under Linux the basic tools are sufficient to include arbitrary binary data in object files and access them as extern symbols. Here is my example file. To make it more fun, the same file is also a Makefile and a shell script, and the program prints itself when run (without requiring the source file to be present).

Note that the first character in most of these lines is TAB, not space. Get the raw source file from pastebin.com instead of using c&p here.

# /* Self-printer using objcopy. To build, save as foo.c and run "sh foo.c".
dummy= exec make -f foo.c
all:
        objcopy -I binary -O elf64-x86-64 -B i386:x86-64 foo.c foo-src.o
        gcc -g -o foo foo.c foo-src.o

dummy:
#  */
        #include <stdio.h>

        extern char _binary_foo_c_start;
        extern char _binary_foo_c_size;
        extern char _binary_foo_c_end;

        char *bin_start = &_binary_foo_c_start;
        char *bin_end = &_binary_foo_c_end;
        ssize_t bin_size = (ssize_t) &_binary_foo_c_size;

        int
        main (int argc, char *argv[])
        {
          printf ("%*s\n", (int) bin_size, bin_start);
          return 0;
        }
Posted in Programming | Tagged , | Leave a comment