Monday, January 30, 2012

Cause’s Levenshtein Distance Challenge

I hadn’t seen this before, but I bumped into a Python version on the web:

His time of 11 seconds - 12 seconds on my laptop - seems a little slow :)

From the beginning, two things jump out: that he’s using the same splices a lot and that he’s using yield.

So lets tackle the re-use of slices first: if you want the left-part of the current word then you can just getting the slice more than once is redudant; simply put b[:i] into a local variable called left:

import string
w = set(open("word.list").read().splitlines())
f, nf = set(), set(["causes"])

#from b, yield all unused words where levdist==1
def nextgen(b):
    for i in xrange(len(b)): #for each index in b
        left, ch, at, right = b[:i], b[i], b[i:], b[i+1:]
        for c in string.ascii_lowercase: #for letters [a..z]
            if c != ch:
                #substitute b[i] with c
                x = left + c + right
                if x in w: yield x
                #inject c before b[i]
                x = left + c + at
                if x in w: yield x
        #remove b[i]
        x = left + right
        if x in w: yield x
   
    for c in string.ascii_lowercase: #for letters [a..z]
        x = b + c
        if x in w: yield x #append c after b

while len(nf):
    cf = nf
    nf = set(j for i in cf for j in nextgen(i) if j not in f)
    w -= nf
    f |= nf

print len(f)
assert (len(f) == 78482)

There, runtime down to 9.45 seconds on my laptop.

Next, lets tackle yield.  If we just put them into a temporary set instead:

f, nf, w = set(), ["causes"], set(open("word.list").read().splitlines())

def next():
    f = []
    for b in nf:
        for i in range(len(b)): #for each index in b
            left, at, right = b[:i], b[i:], b[i+1:]
            for c in 'abcdefghijklmnopqrstuvwxyz':
                #substitute b[i] with c
                x = left + c + right
                if x in w: f.append(x)
                #inject c before b[i]
                x = left + c + at
                if x in w: f.append(x)
            #remove b[i]
            x = left + right
            if x in w: f.append(x)
        for c in 'abcdefghijklmnopqrstuvwxyz':
            x = (b + c)
            if x in w: f.append(x) #append c after b
    return set(f)

while nf:
    nf = next()
    w -= nf
    f |= nf

print len(f)
assert (len(f) == 78482)

Hmm, only a marginal improvement.  Under 9 seconds.

It would be fun with stackless python or pypy to test out a recursive approach.  But I’m not going to zap my daily python for that test!

With small tweaks, like renaming next(), it runs in shedskin:

    w.difference_update(nf)
    f.update(nf)

Hmm.  22 seconds.  Not going in the right direction :)  I’ll send it in to the shedskin list so they can consider how its hurting.


 ↓ click the "share" button below!