Comments on: lost in the haystack

By: Sugerir resultados de búsqueda con Levenshtein | ¿y por qué no?

Sugerir resultados de búsqueda con Levenshtein | ¿y por qué no? — Mon, 06 Sep 2010 10:25:05 +0000

[...] http://en.nicoptere.net/?p=854 [...]

By: grgrdvrt

grgrdvrt — Sun, 22 Aug 2010 23:25:16 +0000

Bonsoir :D

While I was working on my project at Gobelins, I had to make a simple auto-completion system based on a French dictionary (336,532 words). Quite challenging but Philippe (a guy who is not too bad at programming, you know him?) was here to give me a hand.

We simply did this :
-load a *.txt containing the raw list
-split it on \n
-play with àäèéù….
-make an object containing 26 arrays

Once the file is loaded it takes 5 seconds to my 2-years old laptop to process the 300,000 words.
And Philip found an idea to save this processing time !
He wrote a simple AIR application that uses FileStream.writeObject to store the object into an external file !

The final application just have to load he file and use ByteArray.readObject to get the data with the good structure (and without any stack overflow :D)

For the second part (searching a word in the list) a simple for-each loop in the appropriate array did the job.

See you at Brighton (or before?)

(par respect pour tes lecteurs j’ai écrit en anglais, je crois pas que je le ferai deux fois :D)

By: nicoptere

nicoptere — Sat, 21 Aug 2010 10:07:03 +0000

@Miller Medeiros thanks for the hints, I wasn’t even curious enough to check the improvements. glad I’ve spontaneously implemented some of them (O(m ) instead of O(m,n )), makes me feel smart ^^.
they’re talking about a threshold too, I had that in mind as in my case, the stem is a substring of the candidate.

you’re right about the special chars & the storage, I was thiking of special-char-replacing all the candidates and store the original strings somewhere.
then convert all the candidates into vectors of ints to perform both the Levenshtein & the ‘distance checks’.
this might be outrageously faster despite the pre process, should look into it :)

nice StringUtils class you have there btw :)

@philipp and you haven’t seen the worst names, sometimes I wonder about the parents’ sanity … btw it’s mostly girl’s name that are fucked up, I suspect some undergoing machism behind all that ( like “after all, a girl’s name doesn’t have to be serious if she’s to cook and look after the kids, let’s call her honey-cherry-love” ).

otherwise, I understand your point, it seems more ‘natural’ to have them sorted by length, it’s still possible to sort the results on length + alphabetically or to add their length to the score.

thanks to both of you for passing by :)

By: philipp

philipp — Sat, 21 Aug 2010 09:42:22 +0000

indeed, very interesting … even when i didn’t understand it fully (still a bit sleepy this morning) :)

but just like peter mentioned, i tried some names and was wondering why i went from 10 matches to 1 to 5 to 1 to X. it’s a bit confusing – especially if you type sth like “ge” – didn’t know that this is an actual name ;) – and it only returns the “ge”. seemed broken at first sight – until i read the comments.

i prefer to add some potential matches at the end like so:

ge (as the exact match)
gees
geena
geete
geeske
geesje
geenia
geertrui

but may it’s just me or the way google has been conditioning me over the last years :D

keep up the geek stuff.
cheers phil

By: Miller Medeiros

Miller Medeiros — Fri, 20 Aug 2010 15:56:14 +0000

very nice approach and explanation, the wikipedia entry for Levenshtein distance has some ideas for improvements... one thing that I kind wondered is that maybe it was going to be faster/easier to replace all the special chars of all the names before converting into an Array and keeping a copy of the original strings, you would use more memory but don't know if it would affect that much... - maybe you don't even need 2 arrays, just keep the original as a string and retrieve the original name based on the number of commas before it. I know that the string methods (indexOf, etc...) are faster than RegExp but since you are doing on a loop maybe the RegExp approach would be faster. Here is a simple implementation of a special char replacement: StringUtils.replaceAccents. Cheers!

By: nicoptere

nicoptere — Fri, 20 Aug 2010 07:59:47 +0000

@seraf bisou

@marpi true, I could have done that too :) I my case it was mandatory to have it all embedded.

@og2t I guess you could pile up a test where you write down the candidate in a textfield and store the textfield.textWidth property to alter the score :)

@cay I’m not at ease with regexp actually, it would surely save a lot of process time. I’ll give your option a thourough try on monday, thanks :)

@peter not working would mean ‘*no match found* BUT ;) in that case ‘ge’ is a perfect match and rule n°3 specifies that a perfect match returns a single string which is what occurs here :)

By: Peter Strømberg

Peter Strømberg — Fri, 20 Aug 2010 06:07:28 +0000

Sweet, BUT there seems to be a bug. When I search for georgine, “g” gives a load of results, “ge” gives “0″ results, “geo” again gives a list of results. Other 2 letter combinations work, BUT for some reason the algorithm is biased against General Electric…

By: Cay

Cay — Thu, 19 Aug 2010 23:20:33 +0000

As I tweeted before, great post!… I did a quick test with regexp and it seems pretty fast… don’t know if it suits you though:
var str:String=tf.text;
var pattern:String = “(?<=,)" + str + "[^,]+?(?=,)";
var exp:RegExp= new RegExp(pattern, "g");
trace(a.match(exp));
Searching "ar" in your "a" CSV string took 6ms instead of 20 in your testing app…
Also, indexOf might be even faster if you just need a few results… or if you need all of them and your string is in alphabetical order, you could find the first and last occurance of, for instance, ",ar", and retrieve all words in between…
Of course, you lose the niceness of Levenshtein's "relevancy", but maybe in your case it's not that necessary :)

By: Og2t

Og2t — Thu, 19 Aug 2010 21:48:47 +0000

Awesome stuff Nico! I am gonna use it for one of my old projects, basically need to use all polish words. Additional difficulty is that each word has to have it’s width (for a given font and size), so I can swap the words that have the same width. Do you think I could use your discoverings and search by width as well? (once everything is processed/measured).

Cheers!

By: marpi

marpi — Thu, 19 Aug 2010 21:35:16 +0000

did you try loading the list on runtime?
i didn’t have any problem with loading 15.7 MB .txt file. 1273819 lines.
var arr:Array = txt.split(“\n”)
i’ll send you the link via mail