Re: Speller data structures

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Speller data structures

Bram Moolenaar

Olaf Seibert wrote:

> Recently I wrote about a trie data structure for spelling word lists.
> There was some doubt as to the memory efficiency. Therefore I did a
> small test with the Polish wordlist, which was reputed to be 60
> megabytes. The one I found was only 40 but was 60 in some expanded form
> for Aspell. My file was only 1 megabyte.

I now (finally!) had time to look closer at this code.  I must say it
looks good.  The extra tricks to share the tails of the tree help a lot
to keep the storage low.  For the languages that I tried the storage is
far less than what was used until now.

The code is a bit low on comments, thus it's difficult to understand
what exactly is going on.  Hopefully I didn't understand something

Main drawback that I see is the linear search to lookup a character in a
list of siblings.  Since most checked words will find a match, this will
be much slower than using a hash lookup.

I think the linear search can be replaced by a binary search.  The
siblings are already sorted, this just requires storing the siblings in
a way to allow the binary search.  This probably disallows sharing
siblings, but that is a small price to pay.

The code becomes a lot simpler, since all the stuff for affix
compression can be omitted.  And the trouble with word vs non-word
characters is gone.  This will avoid bugs and complexity.

Of course there are still a lot of things to do, such as figuring out an
efficient way to store multi-byte text, support for regions, flags for
allcaps, etc.

How To Keep A Healthy Level Of Insanity:
9. As often as possible, skip rather than walk.

 /// Bram Moolenaar -- [hidden email] --   \\\
///        Sponsor Vim, vote for features -- \\\
\\\              Project leader for A-A-P --        ///
 \\\     Buy LOTR 3 and help AIDS victims --   ///