Spell suggestions with postponed prefixes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spell suggestions with postponed prefixes

Bram Moolenaar

I have now also implemented using postponed prefixes for making
suggestions.  Currently only Hebrew uses this.

Please give it a try and see how well it works.  I didn't do any
specific scoring for prefixes, since I don't know how to do that.

It does look like the suggestions are valid words or combinations with
allowed prefixes.  But you better check that no suggestions are actually
wrongly spelled words.

It's in the snapshot that I will upload in a couple of hours.

I also implemented a different method for sound folding.  It's simpler
and faster.  I'm using it for Dutch to try out.  Should be simple to add
to any language.

--
hundred-and-one symptoms of being an internet addict:
177. You log off of your system because it's time to go to work.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Spell suggestions with postponed prefixes

Moshe Kaminsky
* Bram Moolenaar <[hidden email]> [30/06/05 00:29]:

>
> I have now also implemented using postponed prefixes for making
> suggestions.  Currently only Hebrew uses this.
>
> Please give it a try and see how well it works.  I didn't do any
> specific scoring for prefixes, since I don't know how to do that.
>
> It does look like the suggestions are valid words or combinations with
> allowed prefixes.  But you better check that no suggestions are actually
> wrongly spelled words.
I checked it on several examples, it looks fine. I noticed that getting
a suggestion which is a rare word is quite rare... I guess it can be
tuned.

I don't know if my original suggestions are used, but I saw the code a
few days ago, and it appears to me there is an implementation which both
simpler and should give better results: Simply run the state machine
with both trees, the prefixes and the words, and when there is a word
split, continue each time with the other tree. When passing from the
word tree, there should be an actual splitting and a penalty, but not
vice versa (when the prefixes are not postponed, both can point to the
same tree). This way, the length of the prefix need not be considered,
since long prefixes will by penalised anyway for being rare.

>
> It's in the snapshot that I will upload in a couple of hours.
>
> I also implemented a different method for sound folding.  It's simpler
> and faster.  I'm using it for Dutch to try out.  Should be simple to add
> to any language.

Is it correct that the (original) SAL mechanism is mainly useful when
there are combinations of several letters that sound like one? When
there are only several letters such that one sounds similar to another
(like c and k), the new method is equivalent to the original? Also, what
is the advantage/disadvantage in this case over specifying, say,
REP c k
(I guess it saves writing when there are more than two that sound the
same, but is there any other difference?)

Thanks,
Moshe

>
> --
> hundred-and-one symptoms of being an internet addict:
> 177. You log off of your system because it's time to go to work.
>
>  /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
> ///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
> \\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
>  \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
>
--
I love deadlines. I like the whooshing sound they make as they fly by.
                                        -- Douglas Adams
   
    Moshe Kaminsky <[hidden email]>
    Home: 08-9456841


attachment0 (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Spell suggestions with postponed prefixes

Bram Moolenaar

Moshe Kaminsky wrote:

> * Bram Moolenaar <[hidden email]> [30/06/05 00:29]:
> >
> > I have now also implemented using postponed prefixes for making
> > suggestions.  Currently only Hebrew uses this.
> >
> > Please give it a try and see how well it works.  I didn't do any
> > specific scoring for prefixes, since I don't know how to do that.
> >
> > It does look like the suggestions are valid words or combinations with
> > allowed prefixes.  But you better check that no suggestions are actually
> > wrongly spelled words.
>
> I checked it on several examples, it looks fine. I noticed that getting
> a suggestion which is a rare word is quite rare... I guess it can be
> tuned.
>
> I don't know if my original suggestions are used, but I saw the code a
> few days ago, and it appears to me there is an implementation which both
> simpler and should give better results: Simply run the state machine
> with both trees, the prefixes and the words, and when there is a word
> split, continue each time with the other tree. When passing from the
> word tree, there should be an actual splitting and a penalty, but not
> vice versa (when the prefixes are not postponed, both can point to the
> same tree). This way, the length of the prefix need not be considered,
> since long prefixes will by penalised anyway for being rare.

The mechanism simply considers every valid word, including prefixes.
The edit distance from the badly spelled word is computed, only words
with a small distance are used.  This doesn't take the length of the
prefix into account, it doesn't matter where the prefix stops and the
basic word starts.  Would it be good to give a penalty to longer
prefixes?  You would need to experiment with this, using a list of
actual misspellings.  Just trying a few artificial misspellings may give
a wrong impression.

> > It's in the snapshot that I will upload in a couple of hours.
> >
> > I also implemented a different method for sound folding.  It's simpler
> > and faster.  I'm using it for Dutch to try out.  Should be simple to add
> > to any language.
>
> Is it correct that the (original) SAL mechanism is mainly useful when
> there are combinations of several letters that sound like one? When
> there are only several letters such that one sounds similar to another
> (like c and k), the new method is equivalent to the original? Also, what
> is the advantage/disadvantage in this case over specifying, say,
> REP c k
> (I guess it saves writing when there are more than two that sound the
> same, but is there any other difference?)

The SAL mechanism is only useful if you can turn a word into its
"sound-a-like" equivalent.  For English the mechanism is to leave out
all vowels and do some tricks with "th", "gh", etc.  In Dutch I would
have "sch" sound the same as "s".  In general the length of the
sound-a-like word is much less, thus more words look alike.

Using REP items with single letters isn't very useful, since that will
be tried anyway.  It's only that they may get a slightly better score
that way.  It counts a lot more when replacing several characters at
once.

--
hundred-and-one symptoms of being an internet addict:
248. You sign your letters with your e-mail address instead of your name.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///