Thesaurus

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Thesaurus

Bill McCarthy
Hello Vim Developers,

I thought I had successfully used Mobythes.aur (a 24
meg comma separated thesaurus with meaningful spaces)
in the past.  It has some lines as long 12k characters.

One place to get this file is:

http://www.gtoal.com/wordgames/oxford_wordlists/Moby/mthes/

I now have bad results with Ctrl-X Ctrl-T.

Is there a fairly simply way to modify Vim to handle
this fairly comprehensive thesaurus?

--
Best regards,
Bill


Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bill McCarthy
Continuing my own thread, The latest build (97) appears
to handling very long lines (much longer that the 510
bytes specified in the documentation - :help tsr).

I don't know how you did it Bram (LSIZE is still 512)
in the Thesaurus section of edit.c, but it appears to
work!

Before I use <c-x><c-t> I have found that I must set
'noinfercase' (in addition to 'infercase', both
'ignorecase' and 'smartcase' are set) and, to deal with
the space and dash separated "words", I set
iskeyword+=32,-

[I have failed to write a inoremap to set and reset
these options around a <c-x><c-t>.]

UNWANTED BEHAVOUR OF <C-X><C-T>
-------------------------------

It appears to operate by gathering the target word and
all words to the right of the target word in lines
containing the target word - excluding duplicates.

This is far too many words - many of which are only
faintly related to the target.

In the mobythes.aur file specified in the documentation
http reference, each line starts with a target word and
is followed by fairly closely related words (all comma
separated).

For example, <c-x><c-t> gathers, for the word 'obese',
178 words in version 6.3 (with line truncation).
However in 7.0aa.0097, it gathers 1,556 unique words!
The Moby thesaurus would gather only 51 words.  [In all
cases I am calling 'words' those things between
commas.]

While your shotgun approach may be useful if none of
the 50 alternatives are not good enough, there should
be a way of just gathering the words on the root line.

Thanks for listening.

--
Best regards,
Bill


Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bram Moolenaar

Bill McCarthy wrote:

> Continuing my own thread, The latest build (97) appears
> to handling very long lines (much longer that the 510
> bytes specified in the documentation - :help tsr).
>
> I don't know how you did it Bram (LSIZE is still 512)
> in the Thesaurus section of edit.c, but it appears to
> work!

Vim will still be able to read the file, but the lines are truncated at
510 characters.  You don't see words that come after column 512 being
used, right?

> Before I use <c-x><c-t> I have found that I must set
> 'noinfercase' (in addition to 'infercase', both
> 'ignorecase' and 'smartcase' are set) and, to deal with
> the space and dash separated "words", I set
> iskeyword+=32,-
>
> [I have failed to write a inoremap to set and reset
> these options around a <c-x><c-t>.]
>
> UNWANTED BEHAVOUR OF <C-X><C-T>
> -------------------------------
>
> It appears to operate by gathering the target word and
> all words to the right of the target word in lines
> containing the target word - excluding duplicates.
>
> This is far too many words - many of which are only
> faintly related to the target.
>
> In the mobythes.aur file specified in the documentation
> http reference, each line starts with a target word and
> is followed by fairly closely related words (all comma
> separated).
>
> For example, <c-x><c-t> gathers, for the word 'obese',
> 178 words in version 6.3 (with line truncation).
> However in 7.0aa.0097, it gathers 1,556 unique words!
> The Moby thesaurus would gather only 51 words.  [In all
> cases I am calling 'words' those things between
> commas.]
>
> While your shotgun approach may be useful if none of
> the 50 alternatives are not good enough, there should
> be a way of just gathering the words on the root line.

The completion is very simplistic, it just uses all matches it can find.
You have to trim your thesaurus file if you get too many matches.  Thus
the intelligence is in building the file, not in Vim reading it.

Anyway, I don't think anything changed in this area, I'm surprised Vim 7
finds more matches than 6.3.  Perhaps there is a mistake somewhere.  Can
you find a simple way to reproduce the difference?  Perhaps just using a
few lines in the thesaurus file already shows the problem?

--
hundred-and-one symptoms of being an internet addict:
180. You maintain more than six e-mail addresses.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bill McCarthy
Bram,

On Thu 30-Jun-05 4:35am -0500, you wrote:

> Bill McCarthy wrote:

>> Continuing my own thread, The latest build (97) appears
>> to handling very long lines (much longer that the 510
>> bytes specified in the documentation - :help tsr).
>>
>> I don't know how you did it Bram (LSIZE is still 512)
>> in the Thesaurus section of edit.c, but it appears to
>> work!

> Vim will still be able to read the file, but the lines are truncated at
> 510 characters.  You don't see words that come after column 512 being
> used, right?

Not here.  Vim 6.3 truncates to 510 characters.  Vim
7.0aa.0097 does not appear to truncate.  Here is an
example.  Create a file named 'thes', in your current
directory, with this single line:

adipose,beefy,big-bellied,bloated,blowzy,blubbery,bosomy,brawny,burly,buttery,butyraceous,buxom,chrismal,chrismatory,chubby,chunky,corpulent,distended,dumpy,fat,fattish,fatty,fleshy,full,greasy,gross,heavyset,hefty,hippy,imposing,lardaceous,lardy,lusty,meaty,mucoid,obese,oily,oleaginous,oleic,overweight,paunchy,plump,podgy,portly,potbellied,pudgy,puffy,pursy,rich,roly-poly,rotund,sebaceous,sleek,slick,slippery,smooth,soapy,square,squat,squatty,stalwart,stocky,stout,strapping,suety,swollen,tallowy,thick-bodied,thickset,top-heavy,tubby,unctuous,unguent,unguentary,unguentous,well-fed

Open Gvim 6.3, and type the following:

   :se noinf isk+=32,- trs=thes<CR>
   iobese<c-x><c-t>

You should see, at the bottom of your screen:

-- Thesaurus completion (^T^N^P) match 1 of 33

Hit <c-p> twice to go to the last (33rd) word.  It is
'thick-bod' (that is thick-bodied with 510 character
line truncation).

Now repeat this simple process with 7.0aa.0097.  This
time you should see. at the bottom of your screen:

-- Thesaurus completion (^T^N^P) match 1 of 41

Hitting <c-p> twice shows the last word of the physical
line.

>> Before I use <c-x><c-t> I have found that I must set
>> 'noinfercase' (in addition to 'infercase', both
>> 'ignorecase' and 'smartcase' are set) and, to deal with
>> the space and dash separated "words", I set
>> iskeyword+=32,-
>>
>> [I have failed to write a inoremap to set and reset
>> these options around a <c-x><c-t>.]
>>
>> UNWANTED BEHAVOUR OF <C-X><C-T>
>> -------------------------------
>>
>> It appears to operate by gathering the target word and
>> all words to the right of the target word in lines
>> containing the target word - excluding duplicates.
>>
>> This is far too many words - many of which are only
>> faintly related to the target.
>>
>> In the mobythes.aur file specified in the documentation
>> http reference, each line starts with a target word and
>> is followed by fairly closely related words (all comma
>> separated).
>>
>> For example, <c-x><c-t> gathers, for the word 'obese',
>> 178 words in version 6.3 (with line truncation).
>> However in 7.0aa.0097, it gathers 1,556 unique words!
>> The Moby thesaurus would gather only 51 words.  [In all
>> cases I am calling 'words' those things between
>> commas.]
>>
>> While your shotgun approach may be useful if none of
>> the 50 alternatives are not good enough, there should
>> be a way of just gathering the words on the root line.

> The completion is very simplistic, it just uses all matches it can find.
> You have to trim your thesaurus file if you get too many matches.  Thus
> the intelligence is in building the file, not in Vim reading it.

If I correctly understand what you are doing (see
above), a possible restructure of the 24 meg Moby file
would be to replace each line of n+1 "words" with n
lines containing the root and the alternative.  For
example, replace:

    a,b,c,d,e

with

    a,b
    a,c
    a,d
    a,e

Merely because, using the a,b,c,d,e line above, c or e,
perhaps in different contexts, may be substituted for
a, it does not mean that c may ever be a good
substitute for e.

As I have attempted to point out, a much better, IMHO,
solution is to have Gvim treat the thesaurus file as a
thesaurus file:  find te root word starting in the
first column of a line and use the words on that same
line as the alternatives.

> Anyway, I don't think anything changed in this area, I'm surprised Vim 7
> finds more matches than 6.3.  Perhaps there is a mistake somewhere.  Can
> you find a simple way to reproduce the difference? Perhaps just using a
> few lines in the thesaurus file already shows the problem?

I certainly could have made a mistake in my analysis.
I hope the above example, using the root word adipose,
is reproducible.

--
Best regards,
Bill


Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bill McCarthy
On Thu 30-Jun-05 8:41am -0500, I wrote:

>    :se noinf isk+=32,- trs=thes<CR>

Sorry, that line should be:

    :se noinf isk+=32,- tsr=thes<CR>

--
Best regards,
Bill


Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bram Moolenaar
In reply to this post by Bill McCarthy

Bill McCarthy wrote:

> Not here.  Vim 6.3 truncates to 510 characters.  Vim
> 7.0aa.0097 does not appear to truncate.  Here is an
> example.  Create a file named 'thes', in your current
> directory, with this single line:
>
> adipose,beefy,big-bellied,bloated,blowzy,blubbery,bosomy,brawny,burly,buttery,butyraceous,buxom,chrismal,chrismatory,chubby,chunky,corpulent,distended,dumpy,fat,fattish,fatty,fleshy,full,greasy,gross,heavyset,hefty,hippy,imposing,lardaceous,lardy,lusty,meaty,mucoid,obese,oily,oleaginous,oleic,overweight,paunchy,plump,podgy,portly,potbellied,pudgy,puffy,pursy,rich,roly-poly,rotund,sebaceous,sleek,slick,slippery,smooth,soapy,square,squat,squatty,stalwart,stocky,stout,strapping,suety,swollen,tallowy,thick-bodied,thickset,top-heavy,tubby,unctuous,unguent,unguentary,unguentous,well-fed
>
> Open Gvim 6.3, and type the following:
>
>    :se noinf isk+=32,- trs=thes<CR>

Ehm, adding a space to 'iskeyword'?   Not a good idea...

>    iobese<c-x><c-t>
>
> You should see, at the bottom of your screen:
>
> -- Thesaurus completion (^T^N^P) match 1 of 33
>
> Hit <c-p> twice to go to the last (33rd) word.  It is
> 'thick-bod' (that is thick-bodied with 510 character
> line truncation).

Yes.

> Now repeat this simple process with 7.0aa.0097.  This
> time you should see. at the bottom of your screen:
>
> -- Thesaurus completion (^T^N^P) match 1 of 41
>
> Hitting <c-p> twice shows the last word of the physical
> line.

Nope, I get exactly the same results as with 6.3.  Perhaps somehow
compilation caused LSIZE to be defined to another number?

> If I correctly understand what you are doing (see
> above), a possible restructure of the 24 meg Moby file
> would be to replace each line of n+1 "words" with n
> lines containing the root and the alternative.  For
> example, replace:
>
>     a,b,c,d,e
>
> with
>
>     a,b
>     a,c
>     a,d
>     a,e
>
> Merely because, using the a,b,c,d,e line above, c or e,
> perhaps in different contexts, may be substituted for
> a, it does not mean that c may ever be a good
> substitute for e.
>
> As I have attempted to point out, a much better, IMHO,
> solution is to have Gvim treat the thesaurus file as a
> thesaurus file:  find te root word starting in the
> first column of a line and use the words on that same
> line as the alternatives.

I understand that when "a" has alternatives "b" and "c" this doesn't
necessarily mean "c" is an alternative for "b".

I hesitate to change how it works, some people might depend on how it's
currently working.  You are the first to make the remark that you would
like to see it work in another way.  I don't know how many people like
to keep it the way it is...

--
I'm in shape.  Round IS a shape.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bill McCarthy
On Thu 30-Jun-05 1:22pm -0500, Bram Moolenaar wrote:

>
> Bill McCarthy wrote:
>
>> Not here.  Vim 6.3 truncates to 510 characters.  Vim
>> 7.0aa.0097 does not appear to truncate.  Here is an
>> example.  Create a file named 'thes', in your current
>> directory, with this single line:
>>
>> adipose,beefy,big-bellied,bloated,blowzy,blubbery,bosomy,brawny,burly,buttery,butyraceous,buxom,chrismal,chrismatory,chubby,chunky,corpulent,distended,dumpy,fat,fattish,fatty,fleshy,full,greasy,gross,heavyset,hefty,hippy,imposing,lardaceous,lardy,lusty,meaty,mucoid,obese,oily,oleaginous,oleic,overweight,paunchy,plump,podgy,portly,potbellied,pudgy,puffy,pursy,rich,roly-poly,rotund,sebaceous,sleek,slick,slippery,smooth,soapy,square,squat,squatty,stalwart,stocky,stout,strapping,suety,swollen,tallowy,thick-bodied,thickset,top-heavy,tubby,unctuous,unguent,unguentary,unguentous,well-fed
>>
>> Open Gvim 6.3, and type the following:
>>
>>    :se noinf isk+=32,- trs=thes<CR>
>
> Ehm, adding a space to 'iskeyword'?   Not a good idea...

It is needed with the current version of mobythes.aur,
as is $, ', - and /.

I have rebuilt mobythes.aur to have on two words per
line by transforming:  a,b,c,d
to:
   a,b
   a,c
   a,d

I have also eliminated all lines containing space,
dash or '/' (this also eliminated all lines with "'" or
'$').

I found that your method of selecting any found word.
and associating with it any word to the right, to be
most unsatisfactory.

I had to add a "b,a" line for every "a,b" (since if a
associates with b, then, for the same nuance, b
associates with a).

This, unfortunately, increases the size of my new
binary thesaurus to 42.96 million bytes (the original
file is 24.85 million bytes).

Although my new file works beautifully, it would by
half the size (and about twice as fast) if you also
used the first word on a line (the root word) as a
valid association (instead of the somewhat illogical
all words to the right).

> I hesitate to change how it works, some people might depend on how it's
> currently working.  You are the first to make the remark that you would
> like to see it work in another way.  I don't know how many people like
> to keep it the way it is...

Even if there are people out there that love the way it
currently works, adding the root word as an association
(which it is and a far better one that an arbitrary on
on the right) should not break anything.

BTW, I love the latest additions to spell checking.  zG
satisfies the need to mark words OK temporarily and the
multiple files with individual access in 'spellfile' is
wonderfully flexible and useful.

--
Best regards,
Bill


Reply | Threaded
Open this post in threaded view
|

Re: Thesaurus

Bram Moolenaar

Bill McCarthy wrote:

> I have rebuilt mobythes.aur to have on two words per
> line by transforming:  a,b,c,d
> to:
>    a,b
>    a,c
>    a,d
>
> I have also eliminated all lines containing space,
> dash or '/' (this also eliminated all lines with "'" or
> '$').
>
> I found that your method of selecting any found word.
> and associating with it any word to the right, to be
> most unsatisfactory.
>
> I had to add a "b,a" line for every "a,b" (since if a
> associates with b, then, for the same nuance, b
> associates with a).
>
> This, unfortunately, increases the size of my new
> binary thesaurus to 42.96 million bytes (the original
> file is 24.85 million bytes).
>
> Although my new file works beautifully, it would by
> half the size (and about twice as fast) if you also
> used the first word on a line (the root word) as a
> valid association (instead of the somewhat illogical
> all words to the right).
>
> > I hesitate to change how it works, some people might depend on how it's
> > currently working.  You are the first to make the remark that you would
> > like to see it work in another way.  I don't know how many people like
> > to keep it the way it is...
>
> Even if there are people out there that love the way it
> currently works, adding the root word as an association
> (which it is and a far better one that an arbitrary on
> on the right) should not break anything.

You never know how people actually use this feature.  You could also use
it for programming with names of library functions...

Nevertheless, I still want to find out how many people would run into
trouble versus how many people are helped when changing this.

> BTW, I love the latest additions to spell checking.  zG
> satisfies the need to mark words OK temporarily and the
> multiple files with individual access in 'spellfile' is
> wonderfully flexible and useful.

I'm glad you like it.  I'm done with adding features to spelling for
now.  Now let people use it for a while to find out any problems.

--
hundred-and-one symptoms of being an internet addict:
221. Your wife melts your keyboard in the oven.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///