Spell checking using a Bloom filter

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spell checking using a Bloom filter

Brandt, Servatius
Hello all -

At the weekend I read something about Bloom filters.  I wondered, if
they could be used to store the word lists for spell checking and
how much this would reduce the size of memory needed at run-time.
I made some investigation: there is considerable potential for
reduction.

Using a Bloom filter would mean that all good words according to the
spell file are accepted as correct spelling, but there is a small amount
of bad words that would be accepted as well.  These are called false
positives.  You can select the probability of false positives as you
like.  The smaller the value you choose, the greater is the size of the
Bloom filter in memory.

Is it acceptable that some badly spelled words slip through?  I think
yes, since this will happen anyway:  your spell file might contain a few
misspelling errors or some words in an additional alternative spelling
you don't like, and some misspellings are good words on their own, so
you can't find them:  I recently typed "her" somewhere instead of
"here", such typos cannot be marked by spell checking.  As long as the
probability of false positives due to the Bloom filter is somewhat
smaller than the rate of unmarked spell errors for those other reasons,
the Bloom filter will work quite well.

We may think of an argument of ":mkspell" which is the probability of
false positives you want.  It would also determine the size of the Bloom
filter.  If you want a probability of exact zero, the spell file would
be stored as it is currently done: as a (bigger) trie.  (In fact, since
the number of exact words is needed to compute the size of the Bloom
filter, it would be reasonable to build the trie first, then build the
Bloom filter in the appropriate size and fill it from the trie, which
then can be released.)  If the Bloom filter happens to accept your
favorite typo, use the zw or :spellwrong command to add it as a bad word
to 'spellfile'.

The theory about Bloom filters is explained very well at
http://en.wikipedia.org/wiki/Bloom_filter.  The calculator at
http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html
allows you to give the number n of entries to be stored and the
probability p of false positives (more exactly: the probability that
a bad word is recognised as good).  The calculator will compute the
number
m of bits the Bloom filter will take in memory and the number k of hash
functions it will use.

Well, how much can we reduce memory usage for a reasonable probability
of false positives?  We need to know the number n of the exact words
accepted by the spell checking file, first.  There might be a tool to
extract the list of these words from the .dic and .aff files.  For me,
it was simpler to hack some code into the store_word() function of
src/spell.c.  A patch is included below, search for the SNIP lines.
Since you are on the development list, you'll know how to produce
a temporary version of Vim that writes out the word list.

I first tried the German spell checking file de_OLDSPELL.zip found at
http://de.openoffice.org/spellcheck/about-spellcheck-detail.html.
":mkspell" reports an estimated memory use of 1324250 bytes for the
trie.  The WORDLIST file produced by running ":mkspell" in the patched
Vim contains 303407 words.  If you enter this as number n in the Bloom
filter calculator, and a probability p of false positives of 0.01, it
tells that you need a Bloom filter of 2910570 bits (363822 bytes) with
7 hash functions.  This is a reduction in memory use of 72.5 percent
compared to the trie.  For p = 0.001 you need a Bloom filter of 4362277
bits (545285 bytes) with 10 hash functions, which is a reduction of
still 58.8 percent.

Is this just a promise?  I wanted to try it in practice.
I found a Bloom filter implementation written in Python at
http://www.imperialviolet.org/pybloom.html.  It can insert
only numbers, so I wrote a Python script blspell.py that inserts
the hash() values of strings into it.  This means that the Bloom
filter's hash functions all operate on the integer result of Python's
builtin string hash() function (function composition).  The quality of
the hash functions matters, but the Python builtin hash algorithm seems
to be fairly good.

As an input file to process I needed some text with lots of spelling
errors.  Since I'm doing German spell checking, I used the concatenation
of all the English Vim help files src/runtime/doc/*.txt.  My program
fills the Bloom filter with all the words from the WORDLIST generated by
":mkspell" in the patched Vim, then parses the (English) input file and
tests whether it's words are accepted by the Bloom filter.  It writes
two output files, one containing the words accepted and one those not
accepted.  The positive list may contain false positives, but the
negative list doesn't contain any false negatives.

To estimate the quality of the filter, I loaded the list of positives
into Vim and used German spell checking to find the false positives,
which are just the words that are undercurled.  The number of those
false positives divided by the sum of the numbers of the false positives
and the recognized negatives (that is, the number of all real negatives)
gives the rate r of false positives for the used input file.  It should
match the probability of false positives the Bloom filter promises.

In the following table, "pos" is the number of all positives, "false"
the number of false positives, and "neg" the number of all recognized
negatives.  The rate is

    r = 1. * false / (false + neg)

These are the results:

    p       n       m       k       pos     false     neg          r

  0.01  303407  2910570     7      49355     7501    601885     0.01231
  0.01  303407  2910570     7        813      117     16517     0.00708
  0.001 303407  4362277    10      42152      426    609088     0.00070
  0.001 303407  4362277    10        714       21     16616     0.00126

For the first and third line I evaluated the lists of positives and
negatives as written by the blspell.py script, for the second and fourth
line, I filtered these lists through "sort -u" before the evaluation.
This removes multiple occurrences of identical words.

The rate r of false positives matches the probability p rather good.
For p = 0.01, the rate is a bit to large when counting multiple words,
but it is considerably lower than p for the unique'd lists.  For
p = 0.001, it's the other way round.

If you want to run my blspell.py script yourself, it is included below.
Again, search for the SNIP lines.  The procedure how to do it is
explained further down (see "Using blspell.py").

If you load the list of negatives into Vim, all its words should be
undercurled.  You may find some that are not.  This is no problem with
the Bloom filter which cannot cause false negatives.  It is a fault in
my Vim patch that doesn't care about case folding when writing the
WORDLIST file.  It should generate two word lists, one for case-folded
words and one for keep-case words.  Then also two Bloom filters would be
necessary.  (Bram's implementation uses two tries as well.)  I found
this a bit too complicated for this investigation.

But back to the list of positives: if you see the false positives
undercurled in Vim, you may think this method of spell checking is bad.
But just for a psychological reason: these are all well-known English
words, so it's hard to see them as typos for German words.  When I had
used a list of random letter sequences instead of the English text for
the German spell checking, you wouldn't bother for some undercurled
ajf4e8jk or QeuREjd.

But how would Boom filters behave on real text?  Hopefully, Bram has not
yet corrected all the typos in the help files, and we can use it to test
the Bloom filter with English spell checking.

I removed the original spell files from the Vim runtime directory and
tried the en_US spell file (20040623 release) from
http://lingucomponent.openoffice.org/spell_dic.html.  The ":mkspell"
command of the patched Vim generates a WORDLIST of 162580 words and
reports an estimated runtime memory use of 822380 bytes for the trie.
I used again the values m and k for n = 162580 as computed by the Bloom
filter calculator for p = 0.01 and p = 0.001.  The Bloom filter needs
a size of 194953 or 292190 bytes, respectively, which is a memory use
reduction of 76.3 or 64.5 percent.

    p       n       m       k       pos     false     neg          r

  0.01  162580  1559623     7     522182      608    127952     0.00473
  0.01  162580  1559623     7       6862       60     11641     0.00513
  0.001 162580  2337517    10     521318       72    128816     0.00056
  0.001 162580  2337517    10       6790        6     11713     0.00051

For the second and fourth line, the output files of blspell.py have been
passed through "sort -u" again.  The list of positives is much larger
than with German spell checking, because the input text was English in
both cases.  The English spell checking produces still lots of negatives
because there are many option names like 'autoindent' or names like
strcmp etc. in the help files.

The rate r of false positives is about half the probability p.  I didn't
know why it is so good when I wrote this.  But a few lines below I'll
know.  The 60 false positives for p = 0.01 where mainly names of
variables, functions or persons.  The only that could be seen as a real
typo is "thirtyfour" instead of "thirty four".  The 6 false positives
for p = 0.001 are: BufWritePost, Haritsis, Roemer, dalet, gpc, oin
(comes from ":[range]j[oin][!} [flags]").  So there is not at all a real
typo in the help file that would not be found because of using a Bloom
filter!

As a long and more realistic English text I downloaded the whole Bible
from http://www.bible.org/netbible.  After stripping off (most) HTML
tags, I used the complete text as input for blspell.py.  These are the
results:

    p       n       m       k       pos     false     neg          r

  0.01  162580  1559623     7     738085      150    713622     0.00021
  0.01  162580  1559623     7      13781       37      7165     0.00514
  0.001 162580  2337517    10     737841       10     71606     0.00014
  0.001 162580  2337517    10      13729        5      7217     0.00069

When evaluating the blspell.py output files directly (first and third
line), the rate r is much smaller than p.  This is because the text
obviously has been spell-checked and so all the false positives are
names which repeat in the text much less frequent than other words.
When removing duplicates before the evaluation, r is again about the
half of p for p = 0.01, and about 70 percent of p for p = 0.001.

The reason for the rate being better than the probability is this:
Because of the flaw in my Vim patch to extract the WORDLIST file, some
correct words are written to the list of negatives, mainly because of
capital letters at the beginning and use of "'".  These are not really
wrong negatives since they are correctly rejected by the kind of spell
checking defined by the WORDLIST file.  However, some such words might
be erroneously (with regard to the WORDLIST) accepted by the Bloom
filter and written to the list of positives.  When counting false
positives in Vim with the ]s command, I don't count those words since
Vim uses the original files for spell checking without the flaws the
extracted WORDLIST has.

Too compensate this, I converted the whole Bible text to lower case and
removed "'s", "n't" and single "'".  Then I get the following results:

    p       n       m       k       pos     false     neg          r

  0.01  162580  1559623     7     769308      171     38838     0.00438
  0.01  162580  1559623     7      12939       26      4433     0.00583
  0.001 162580  2337517    10     769172       35     38974     0.00090
  0.001 162580  2337517    10      12918        5      4454     0.00112

Now, in the list of negatives produced by blspell.py there is not
a single world that is not undercurled when spell checked in Vim.  So
the difference between the original spell files and my WORDLIST doesn't
matter for this input text.  The rate of false positives matches the
probability perfectly for p = 0.001, for p = 0.01 it's even half its
value.  The number of 5 false positives for the whole Bible text seems
very low.

It would be very interesting to use a large English input text with lots
of typos.  Did anyone on the net write a book???  Anyway, in my opinion
the large number of unusual names in the Bible can be seen as
representatives for real typos in a normal English text.

Since the Python code takes rather long because it is interpreted, some
thoughts about performance and the work that would have to be done to
add spell checking via a Bloom filter:  Bloom filters are easy to
implement: one hash function for string to integer conversion, and a set
of k integer hash functions each delivering an index into a large
bit-array of m bits (see the coding in pybloom.py).  That's all.
Inserting an element into the Bloom filter or a lockup for a given value
always takes the same time: for computation of the 1 + k hash values and
accessing the k bit positions in the large bit-array.  It doesn't depend
on the value (except maybe for the string hash function) or the number
of items that have already been inserted in the Bloom filter.

Now what about the Polish language?  For the pl_PL.zip files from
http://lingucomponent.openoffice.org/spell_dic.html, the ":mkspell"
command reports an estimated memory use of 2441220 bytes.  The WORDLIST
file produced with the patched version has 3295017 words.  The Bloom
filter calculator shows the following results:

    p = 0.01:   k =  7,  m = 29885690 bits = 3735712 bytes
    p = 0.001:  k = 10,  m = 44791796 bits = 5598975 bytes

This is much more than the memory the trie needs.  The trie can take
a big advantage from the common parts of the various words that just
have different suffixes.

I don't know if the Polish language allows to identify groups of
suffixes so that if a word root can be combined with one of the suffixes
of a group it can also be combined with all other suffixes in that
group.  If yes, the hash function could first replace any member of such
a group by a single group-identifying character before doing the usual
hash algorithm.  Same for prefixes (like ne-, nej, nejne-). Then the
Bloom filter could win against the trie in regard of memory consumption,
maybe even if the same normalization is done before storing/looking up
words in the trie.

But even if this is possible, this wouldn't be a generic solution since
it requires a special hash function that is aware of the Polish grammar.

For other languages, however, it could be worth doing spell checking
with Bloom filters for reducing memory usage on the cost of very few,
practically irrelevant false positives.  But even if Bram decides not to
do so, it was great fun to try this out.

- Servatius

                          ====================

Using blspell.pre/y:

    1)  Produce a patched Vim, that extracts the WORDLIST file when
using
        ":mkspell".  See the patch below.

    2)  Move any additional 'spellfile's you have to a place where they
        are not found by Vim.

    3)  Invoke the patched Vim.  Use ":mkspell" to produce your spell
        file from the .dic and .aff files.  This will also produce a
file
        named "WORDLIST" in the current directory that contains the list
        of all accepted words.

    4)  Use "wc -l WORDLIST" to compute the number of n accepted words.

    5)  Use the internet Bloom filter calculator to compute the values
        m and k.  Select p as you want.

    6)  Select an appropriate input file.  Something that causes lots of
        spelling errors, for instance a file in a different language:

             cat /whatever/vim7/runtime/doc/*.txt >IN

    7)  Invoke blspell.py in the following form:

             blspell.py m k WORDLIST IN POS NEG

        This doesn't work for every encoding.  All characters should fit
        in 8 bits and the letters a...z and A...Z should use the ASCII
        code (latin1, for instance).  Python will have to interpret its
        byte code, so be patient.

    8)  Load POS into Vim and use the spell checking for which WORDLIST
        was generated.  Any false positives will be marked as spelling
        errors.  Count them using the ]s command.  You can use
        a sequence of gg10000]s, gg5000]s, gg2000]s, gg1000]s ...
        commands until you won't get an error flash (set 'vb'), then
        increase the ]s count again...  Takes half a minute to determine
        the number of false positives.

    9)  Count the number of words in the input file by "wc -l POS NEG".
        Use the total count.  Don't use "wc -w" on the input file, since
        this doesn't count correctly because of punctuation characters
        etc.

    10) Compute the rate: multiply the number of false positives from 8)
        by 100 and divide by the count from 9).

=== SNIP == Vim patch =================================================
--- spell.c.old 2005-06-20 00:14:30.187500000 +0200
+++ spell.c 2005-06-18 19:18:42.078125000 +0200
@@ -2453,6 +2453,8 @@
     hash_clear(&aff->af_suff);
 }
 
+static FILE *wlfd;  /* WORDLIST */
+
 /*
  * Read dictionary file "fname".
  * Returns OK or FAIL;
@@ -2510,6 +2512,8 @@
     if (!vim_isdigit(*skipwhite(line)))
  EMSG2(_("E760: No word count in %s"), fname);
 
+    wlfd = mch_fopen((char *)"WORDLIST", "w");
+
     /*
      * Read all the lines in the file one by one.
      * The words are converted to 'encoding' here, before being added
to
@@ -2624,6 +2628,8 @@
  }
     }
 
+    fclose(wlfd);
+
     if (spin->si_ascii && non_ascii > 0)
  smsg((char_u *)_("Ignored %d words with non-ASCII characters"),
 
non_ascii);
@@ -3061,6 +3067,9 @@
     char_u foldword[MAXWLEN];
     int res;
 
+    if (wlfd != NULL)
+ fprintf(wlfd, "%s\n", word);
+
     (void)spell_casefold(word, len, foldword, MAXWLEN);
     res = tree_add_word(foldword, spin->si_foldroot, ct | flags,
  region,
&spin->si_blocks);
=== SNIP ==============================================================

=== SNIP == blspell.py ================================================
#!/usr/bin/env python

import sys
import pybloom


class Error(Exception): pass


class StringBloom(pybloom.Bloom):
    """ A Bloom filter for strings.

    Uses the integer Bloom filter it is derived from by
inserting/checking
    the hash values of strings.  Python's builtin string hash function
is smart
    enough that the composition of the Bloom filter's integer hash
functions
    with it produces good results.
    """
    def insert(self, val):
        "Hash string to number, then insert using the k hash functions."
        return pybloom.Bloom.insert(self, hash(val))

    def __contains__ (self, val):
        "Hash string to number, then check using the k hash functions."
        return pybloom.Bloom.__contains__(self, hash(val))


def main(m, k, wordlist, textin, posout, negout):
    bloomfilt = StringBloom((m+7) // 8, k) # pass m bits as bytes

    sys.stderr.write("Collecting non-ASCII letters...\n")
    wordfile = open(wordlist, "r")
    wordchars = ""
    for word in wordfile:
        for c in word.strip():
            if c not in wordchars:
                wordchars += c
    wordfile.close()
    sys.stderr.write("Done.\n")

    sys.stderr.write("Building Bloom filter...\n")
    wordfile = open(wordlist, "r")
    for word in wordfile:
        bloomfilt.insert(word.strip())
    wordfile.close()
    sys.stderr.write("Done.\n")

    sys.stderr.write("Processing text input file...\n")
    pos = open(posout, "w")
    neg = open(negout, "w")
    text = open(textin, "r")
    wordmap = "".join(map(
            lambda(x): (" ", chr(x)) [chr(x).isalpha() or chr(x) in
wordchars],
            range(0, 256)))
    for line in text:
        for word in line.translate(wordmap).split():
            if (word in bloomfilt):
                pos.write(word + "\n")
            else:
                neg.write(word + "\n")
    text.close()
    pos.close()
    neg.close()
    sys.stderr.write("Done.\n")


USAGE = """Usage: %s m k WORDLIST TEXTIN POSOUT NEGOUT

WORDLIST contains the list of all correct words, one per line, extracted
from
a spell file.  It is used to build a Bloom filter.  All words in TEXTIN
are
checked against the Bloom filter.  All the accepted words are written to
POSOUT.  Some may be false positives.  Rejected words are written to
NEGOUT.
None of them is a false negative.

To test the Bloom filter use a TEXTIN with lots of bad words (for
instance
some text in the wrong language).  The POSOUT should be small, though.
Load it into Vim and check it with the 'spelllang' option set to the
spell
file from which WORDLIST was extracted.  All the false positives will be
undercurled.  Compare its number against the number of all lines in the
POSOUT and NEGOUT.

'm' is the number of bits in the Bloom filter, 'k' the number
of hash functions used by it.  Use the calculator at
http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html
to compute these values.  Enter the number of words in WORDLIST as 'n'
and the probability for false positives as 'p'.
"""
def usage():
    sys.stderr.write(USAGE % sys.argv[0])

if __name__ == "__main__":
    try:
        if len(sys.argv) != 7:
            raise Error, "wrong number of arguments"
        try:
            m = int(sys.argv[1])
            if m <= 0:
                raise Error
        except:
            raise Error, "m must be a positive integer"
        try:
            k = int(sys.argv[2])
            if k <= 0:
                raise Error
        except:
            raise Error, "k must be a positive integer"

        try:
            main(m, k, *sys.argv[3:7])
        except IOError, e:
            sys.stderr.write("Error: %s\n\n" % str(e))

    except Error, e:
        sys.stderr.write("Error: %s\n\n" % str(e))
        usage()
=== SNIP ==============================================================

------------------------------------------------------------------------
Servatius Brandt             Phone: +49 89 636-41504
Fujitsu Siemens Computers    Fax:   +49 89 636-48716
EP SW AD C++                 Email: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spell checking using a Bloom filter

Bram Moolenaar

Servatius Brandt wrote:

> At the weekend I read something about Bloom filters.  I wondered, if
> they could be used to store the word lists for spell checking and
> how much this would reduce the size of memory needed at run-time.
> I made some investigation: there is considerable potential for
> reduction.

The size of the tree currently is not really a problem.  It's quite
efficient already.  Not that we shouldn't try to make it smaller, but I
wouldn't want to sacrifice quality or functionality for it.

If I understand the Bloom filter correctly, it requires computing hash
values from the word.  The problem with spell checking is that we don't
always know where the word starts or ends.  My previous implementation
was using a hash function and I had to add lots of code to handle words
that contain non-word characters (most common is the single quote, such
as in "shouldn't", but spaces and dots are also possible).

Guessing where the word starts and ends can be problematic, since we
need to find the longest match.  For example "et al." is used as a word.
That's because "et" and "al" separately are not correct (well, they
should be) and the dot is required.  To find this you would at least
try "et", "et al" and "et al.".  You don't really know what to add more,
since you don't have a hint that a longer word won't be found.

I don't see a way to find spelling suggestions using a Bloom filter.
We would have to add something for that, this kind of defeats the gain
that the Bloom filter would have.

> There might be a tool to extract the list of these words from the .dic
> and .aff files.

The "unmunch" program in the Myspell code does this.  I use it to verify
the .dic and .aff files from Myspell are used correctly.


Interesting investigation snapped...

> Now what about the Polish language?  For the pl_PL.zip files from
> http://lingucomponent.openoffice.org/spell_dic.html, the ":mkspell"
> command reports an estimated memory use of 2441220 bytes.  The WORDLIST
> file produced with the patched version has 3295017 words.  The Bloom
> filter calculator shows the following results:
>
>     p = 0.01:   k =  7,  m = 29885690 bits = 3735712 bytes
>     p = 0.001:  k = 10,  m = 44791796 bits = 5598975 bytes
>
> This is much more than the memory the trie needs.  The trie can take
> a big advantage from the common parts of the various words that just
> have different suffixes.

It's indeed surprising how efficient the trie works for some languages.
The opposite is Dutch, relatively few words require a lot of nodes.
Hebrew also has lots of words but has a very efficient trie.  The .spl
file ends up being smaller than the original .aff file!  But the paint
is still wet on the prefix mechanism, there might be a bug in there.

> I don't know if the Polish language allows to identify groups of
> suffixes so that if a word root can be combined with one of the suffixes
> of a group it can also be combined with all other suffixes in that
> group.  If yes, the hash function could first replace any member of such
> a group by a single group-identifying character before doing the usual
> hash algorithm.  Same for prefixes (like ne-, nej, nejne-). Then the
> Bloom filter could win against the trie in regard of memory consumption,
> maybe even if the same normalization is done before storing/looking up
> words in the trie.

Polish indeed has lots of prefixes and suffixes.  The problem of
handling them separately is that it becomes very complex.  My previous
implementation with a hash table ran into this problem.  Also keep in
mind that affixes can have non-word characters (French uses "d'" and
"l'" a lot).


If you want another challenge: I'm still not happy with the special case
to keep prefixes for Hebrew.  I suspect that if it would be possible to
load all the words in the trie and do the compression the resulting tree
would be small enough to use.  It's the trie size before compression
that is the problem...  Somehow it should be possible to use the two
tries, one with the prefixes and one with all the words, and combine
them without having to build the whole uncompressed tree first.

--
hundred-and-one symptoms of being an internet addict:
101. U can read htis w/o ny porblm and cant figur eout Y its evn listd.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Spell checking using a Bloom filter

Mikołaj Machowski
In reply to this post by Brandt, Servatius
Dnia poniedzia?ek 20 czerwiec 2005 10:25,
[hidden email] napisa?:

> Hello all -
>
> At the weekend I read something about Bloom filters.  I wondered, if
> they could be used to store the word lists for spell checking and
> how much this would reduce the size of memory needed at run-time.
> I made some investigation: there is considerable potential for
> reduction.
>
> Using a Bloom filter would mean that all good words according to the
> spell file are accepted as correct spelling, but there is a small amount
> of bad words that would be accepted as well.  These are called false
> positives.  You can select the probability of false positives as you
> like.  The smaller the value you choose, the greater is the size of the
> Bloom filter in memory.
>
> Is it acceptable that some badly spelled words slip through?  I think
> yes, since this will happen anyway:  your spell file might contain a few

I think not. Yes, they are false positives in normal way, especially
when writing names but they not decrease awareness.

> It would be very interesting to use a large English input text with lots
> of typos.  Did anyone on the net write a book???  Anyway, in my opinion
> the large number of unusual names in the Bible can be seen as
> representatives for real typos in a normal English text.
>
You can check Baen Free Library. http://www.baen.com

Wouldn't expect much typos there :)

> I don't know if the Polish language allows to identify groups of
> suffixes so that if a word root can be combined with one of the suffixes
> of a group it can also be combined with all other suffixes in that
> group.  If yes, the hash function could first replace any member of such

Yes. That's called affix compression. And this is normal way
spellcheckers are working for Polish.


m.
--
LaTeX + Vim = http://vim-latex.sourceforge.net/
Vim-list(s) Users Map: (last change 15 May)
 http://skawina.eu.org/mikolaj/vimlist
CLEWN - http://clewn.sf.net

Reply | Threaded
Open this post in threaded view
|

RE: Spell checking using a Bloom filter

Brandt, Servatius
In reply to this post by Brandt, Servatius
Bram Moolenaar wrote:

> If I understand the Bloom filter correctly, it requires computing hash
> values from the word.  The problem with spell checking is that we
don't
> always know where the word starts or ends.  My previous implementation
> was using a hash function and I had to add lots of code to handle
words
> that contain non-word characters (most common is the single quote,
such
> as in "shouldn't", but spaces and dots are also possible).
>
> Guessing where the word starts and ends can be problematic, since we
> need to find the longest match.  For example "et al." is used as a
word.
> That's because "et" and "al" separately are not correct (well, they
> should be) and the dot is required.  To find this you would at least
> try "et", "et al" and "et al.".  You don't really know what to add
more,
> since you don't have a hint that a longer word won't be found.

I see.  Neither "et" nor "al" nor "et al" would be inserted in the Bloom
filter, only "et al.".  When an "et al." appears in the text during
spell checking, the word "et" would be checked and rejected by the Bloom
filter (with high probability).  Then you could check one word more:
"et al", which also would be rejected.  And then "et al." which would be
accepted.  So it is possible.  The problem just is that the main spell
checking routine doesn't know anything about what word combinations have
to be checked.  So it has to check "first second" whenever the Bloom
lookup for an arbitrary word "first" fails.  The number of additional
checks "first second third" etc. should also be sufficient for every
language, so maybe something between 5 and 10.  This doesn't sound
attractive.

The advantage of the trie is that you can read its contents.  If "et"
fails as a whole word, you can do it the other way round: look what
continuations of "et" are stored in the trie (maybe just "et al.") and
check whether one of those is actually used in the text.  I didn't look
on the spell code very deeply so far, but I bet that's what you are
actually doing.

> I don't see a way to find spelling suggestions using a Bloom filter.
> We would have to add something for that, this kind of defeats the gain
> that the Bloom filter would have.

Agreed.  Again, the problem is that you cannot read out the Bloom filter
and look for words that are alike.

> It's indeed surprising how efficient the trie works for some
languages.
[...]
> Polish indeed has lots of prefixes and suffixes.  The problem of
> handling them separately is that it becomes very complex.  My previous
> implementation with a hash table ran into this problem.  Also keep in
> mind that affixes can have non-word characters (French uses "d'" and
> "l'" a lot).

A colleage of mine is Czech, and he says all slavic languages use lots
of prefixes and suffixes.  Czech declination uses seven cases,
Lithuanian even fifteen.  So what's special with Polish?

Does it matter whether the spell files make use of the .aff file to
define the prefixes and suffixes or whether all the combinations are
simply listed in the .dic file?

- Servatius

------------------------------------------------------------------------
Servatius Brandt             Phone: +49 89 636-41504
Fujitsu Siemens Computers    Fax:   +49 89 636-48716
EP SW AD C++                 Email: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Spell checking using a Bloom filter

Brandt, Servatius
In reply to this post by Brandt, Servatius
Mikolaj Machowski wrote:

> [hidden email] napisal:
>
> > Using a Bloom filter would mean that all good words according to the
> > spell file are accepted as correct spelling, but there is a small
amount
> > of bad words that would be accepted as well.  These are called false
> > positives.  You can select the probability of false positives as you
> > like.  The smaller the value you choose, the greater is the size of
the
> > Bloom filter in memory.
> >
> > Is it acceptable that some badly spelled words slip through?  I
think
> > yes, since this will happen anyway:  your spell file might contain a
few
>
> I think not. Yes, they are false positives in normal way, especially
> when writing names but they not decrease awareness.

The last few times after I had written an English text with the Vim
spell checking switched on and reread the text, I noticed some typos
that were not found by Vim: I typed "her" instead of "here" or "than"
instead of "then".  I estimate these were about 20 percent of the typos
I made.  (For difficult words, I usually take care of the correct
spelling when I type, it's just for easy words when I get sloppy.)
So there is a rate of false positives, anyway.

If I had used a Bloom filter with a probability of false positives of
1 percent, the rate of errors that splipped through would have increased
from 20 to 21 percent.  In practice this means, for most of the texts
I write, there is not a single typo that would slip through due to the
use of a Bloom filter, just every now and then I write a text where
there would be just one such typo.  On rereading the text to find things
like "her(e)", I would find that typo as well.

For someone who doesn't really know the spelling of many words, the rate
of misspellings that happen to match correct words might be much less,
of course.  Say about 1 percent.  Using a Bloom filter that would add
0.1 percent to this rate should be good enough, but can still reduce the
needed memory about 60 percent.

Just if it's not real text you want to spell-check, but you want the
spell checker as a reference to lookup single words, you would need 100
percent correctnes.  But then you would better take a professional
dictionary.  I don't think that any free spell checking files are good
enough.  If you check the spell files of your native language, you might
find lots of rare or maybe even wrong spelling variants.  At least this
was the case for me with the German language files.  A professional
dictionary explaining the variants (local forms, old forms etc.) would
be the better choice.  Bram's idea of showing dialect and rare forms is
very good, but the spell files have to support this distinction.

> > It would be very interesting to use a large English input text with
lots
> > of typos.  Did anyone on the net write a book???  Anyway, in my
opinion
> > the large number of unusual names in the Bible can be seen as
> > representatives for real typos in a normal English text.
> >
> You can check Baen Free Library. http://www.baen.com
>
> Wouldn't expect much typos there :)

Nice book titles on the home page!  Are these books worth doing
something else than feeding in a spell checker? :-)

> > I don't know if the Polish language allows to identify groups of
> > suffixes so that if a word root can be combined with one of the
suffixes
> > of a group it can also be combined with all other suffixes in that
> > group.  If yes, the hash function could first replace any member of
such
>
> Yes. That's called affix compression. And this is normal way
> spellcheckers are working for Polish.

So this means that if we would do affix compression before inserting or
looking up words in a Bloom filter, it could even be used to reduce
memory use for the Polish language.  Except if we can't do this
compression by rules but need all the variants of a word to do it; then
it's not possible.

Anyway, Bram gave some reasons why Bloom filters cannot provide the full
functionality (spell suggestions et al.), so their use is out of
discussion now.

- Servatius

------------------------------------------------------------------------
Servatius Brandt             Phone: +49 89 636-41504
Fujitsu Siemens Computers    Fax:   +49 89 636-48716
EP SW AD C++                 Email: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Spell checking using a Bloom filter

Bram Moolenaar
In reply to this post by Brandt, Servatius

Servatius Brandt wrote:

> The advantage of the trie is that you can read its contents.  If "et"
> fails as a whole word, you can do it the other way round: look what
> continuations of "et" are stored in the trie (maybe just "et al.") and
> check whether one of those is actually used in the text.  I didn't look
> on the spell code very deeply so far, but I bet that's what you are
> actually doing.

When going through the tree you only need a pointer to the first
character.  Then follow the nodes until a possible end-of-word is found
(NUL byte).  This position is remembered, first the search in the tree
continues to find the longest match.  Then all the remembered positions
are checked for an actual match (checking the case, region, affixes and
a non-word character following), starting with the longest possible word
and continuing for shorter words if there is no full match.

This is very efficient, especially since most words are good and will be
found quickly.

> > It's indeed surprising how efficient the trie works for some languages.
> [...]
> > Polish indeed has lots of prefixes and suffixes.  The problem of
> > handling them separately is that it becomes very complex.  My previous
> > implementation with a hash table ran into this problem.  Also keep in
> > mind that affixes can have non-word characters (French uses "d'" and
> > "l'" a lot).
>
> A colleage of mine is Czech, and he says all slavic languages use lots
> of prefixes and suffixes.  Czech declination uses seven cases,
> Lithuanian even fifteen.  So what's special with Polish?
>
> Does it matter whether the spell files make use of the .aff file to
> define the prefixes and suffixes or whether all the combinations are
> simply listed in the .dic file?

For Vim it doesn't matter, since when using a .dic and .aff file
internally all combinations are generated and put in the tree.  Reading
a list of words would result in exactly the same tree.  If the files
specify the same words, of course.

Thus for someone maintaining the word list he can edit the .dic and .aff
files if that's easier than maintaining a full word list.  There are
tools to convert one to the other, thus you can change your mind and/or
verify the results.  You can also put words in the .dic file without an
affix, thus you can still list words if you want to.

--
Laughing helps. It's like jogging on the inside.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///