[PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Lubomir Host
Hi vimmers,

I have a problem with correct fileencoding detection. Files on my system
are in both windows-1250 (cp1250) and iso-8859-2 encodings.  My locales
are set to UTF-8 (it is not important).

I tried to set following:

:set fileencodings=ucs-bom,utf-8,iso-8859-2,windows-1250,latin1
:set fileencodings=ucs-bom,utf-8,windows-1250,iso-8859-2,latin1

but withouth success. File in windows-1250 encoding was not detected
correctly. File is always considered in iso-8859-2 fileencoding, because
iso-8859-2 and windows-1250 character tables are very similiar.

When I compared character tables (mapping to UTF-8) for these encodings
(see links [1] and [2]), I saw the difference. File in correct
iso-8859-2 encoding *DOESN'T* containt characters from 0x7F to 0xA0
range. You can see this difference by command 'vimdiff unic1250.txt
unic8859-2.txt' (these files are attached in zip archive).

List of files in zip archive:

fenc_windows-1250-detection.patch -- patch for vim7.x
test-iso-8859-2.txt     -- example file in iso-8859-2   encoding
test-windows-1250.txt   -- example file in windows-1250 encoding
unic1250.txt            -- character table
unic8859-2.txt          -- character table

Please, try

:set fileencodings=ucs-bom,utf-8,iso-8859-2,windows-1250,latin1

and check, if detected fileencoding matches.

Bram, can you check my patch please? If it is correct, please apply also
to 6.x release. Thanks.

Happy vimming, Lubomir Host 'rajo'

Resources:

[1] - http://cestina.cz/kodovani/unicode/unic1250.txt
[2] - http://cestina.cz/kodovani/unicode/unic8859-2.txt

--
Lubomir Host 'rajo' <rajo AT platon.sk>   ICQ #:  257322664   ,''`.
Platon Software Development Group         http://platon.sk/  : :' :
GnuPG key: http://rajo.platon.sk/en/show,gpgkey              `. `'
http://www.gnu.org/philosophy/no-word-attachments.html         `-  

windows-1250-fenc-patch.zip (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Bram Moolenaar

Lubomir Host wrote:

> I have a problem with correct fileencoding detection. Files on my system
> are in both windows-1250 (cp1250) and iso-8859-2 encodings.  My locales
> are set to UTF-8 (it is not important).
>
> I tried to set following:
>
> :set fileencodings=ucs-bom,utf-8,iso-8859-2,windows-1250,latin1
> :set fileencodings=ucs-bom,utf-8,windows-1250,iso-8859-2,latin1
>
> but withouth success. File in windows-1250 encoding was not detected
> correctly. File is always considered in iso-8859-2 fileencoding, because
> iso-8859-2 and windows-1250 character tables are very similiar.
>
> When I compared character tables (mapping to UTF-8) for these encodings
> (see links [1] and [2]), I saw the difference. File in correct
> iso-8859-2 encoding *DOESN'T* containt characters from 0x7F to 0xA0
> range. You can see this difference by command 'vimdiff unic1250.txt
> unic8859-2.txt' (these files are attached in zip archive).
>
> List of files in zip archive:
>
> fenc_windows-1250-detection.patch -- patch for vim7.x
> test-iso-8859-2.txt     -- example file in iso-8859-2   encoding
> test-windows-1250.txt   -- example file in windows-1250 encoding
> unic1250.txt            -- character table
> unic8859-2.txt          -- character table
>
> Please, try
>
> :set fileencodings=ucs-bom,utf-8,iso-8859-2,windows-1250,latin1
>
> and check, if detected fileencoding matches.
>
> Bram, can you check my patch please? If it is correct, please apply also
> to 6.x release. Thanks.

Although characters 0x80 to 0xa0 officially do not appear in iso-8859-2,
they may appear in a file anyway.  They don't make the file unreadable,
like what would happen for using illegal bytes in UTF-8.

It would be useful to detect the difference between cp1250 and
iso-8859-2 if they both appear in 'fencs'.  Even though this does depend
on certain characters to appear in the file and iso-8859-2 appearing
first.

I think we should at least check that another 8-bit encoding follows
iso-8859-2.  If it's the last one it should be used despite of
characters that don't appear in that encoding.  Those characters also
don't appear in latin1, thus when latin1 comes after iso-8859-2 you
might still want to use iso-8859-2.

--
From "know your smileys":
 :-D Big smile

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Lubomir Host
On Thu, Jun 16, 2005 at 07:30:25PM +0200, Bram Moolenaar wrote:
> Although characters 0x80 to 0xa0 officially do not appear in iso-8859-2,
> they may appear in a file anyway.  They don't make the file unreadable,
> like what would happen for using illegal bytes in UTF-8.

If there is a "mess" in a file, fallback to latin1 should be the best
way. My patch affects *only* those users, who are using latin2/cp1250
(both) encodings. For these peoples the simplest way to distinguish
between latin2 and cp1250 is searching for characters between 0x7F and
0xA0. In general cp1250 file these characters appear in a file (with
more than 99% probability):

cp1250:
0x8A 0x0160 # LATIN CAPITAL LETTER S WITH CARON
0x8C 0x015A # LATIN CAPITAL LETTER S WITH ACUTE
0x8D 0x0164 # LATIN CAPITAL LETTER T WITH CARON
0x8E 0x017D # LATIN CAPITAL LETTER Z WITH CARON
0x8F 0x0179 # LATIN CAPITAL LETTER Z WITH ACUTE
0x9A 0x0161 # LATIN SMALL LETTER S WITH CARON
0x9B 0x203A # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C 0x015B # LATIN SMALL LETTER S WITH ACUTE
0x9D 0x0165 # LATIN SMALL LETTER T WITH CARON
0x9E 0x017E # LATIN SMALL LETTER Z WITH CARON

Note: these characters *doesn't* appear in iso-8859-2 file.

> It would be useful to detect the difference between cp1250 and
> iso-8859-2 if they both appear in 'fencs'.  Even though this does depend
> on certain characters to appear in the file and iso-8859-2 appearing
> first.

They are also other possibilities. Please, take a look at
http://trific.ath.cx/software/enca/ . Maybe linking with libenca library
should be usefull. Enca guess the right encoding on statistics(?) basis.
This is CPU expensive, but the best way.

> I think we should at least check that another 8-bit encoding follows
> iso-8859-2.  If it's the last one it should be used despite of
> characters that don't appear in that encoding.  Those characters also
> don't appear in latin1, thus when latin1 comes after iso-8859-2 you
> might still want to use iso-8859-2.

This sounds very complicated and to hard to explain in manual (for
non-native English speakers ;-) ).

We have 2 ways:

1. - for stable 6.x version:
   - use my hack (we can distinguish between cp1250/latin2)
   - keep 'fileencodings' option

2. - for development 7.x version:
   - use libenca library or include guessing algoritmus from this library
   - remove 'fileencodings' option (useless in this case)



Little off topic: as an input method for different encodings I'm using
my diacritics.vim plugin available as a part of vimconfig project. See
http://platon.sk/projects/vimconfig/ for further details and take a look
on :DiacriticsOn function.

Hint:

"LATIN SMALL LETTER T WITH CARON" can be written as '+t' key sequence in
insert mode.
"LATIN SMALL LETTER A WITH ACUTE" can be written as '=a' key sequence
in insert mode.

For this purpose guessing latin1 encoding instead of latin2/cp1250
encoding (fallback to latin1 in "messed file" case) doesn't hasitate me,
because I am inserting characters with diacritics with vim script. When
I add these character to such file, next time I open file, vim will
detect the right encoding.

Best regards, thanks

rajo

--
Lubomir Host 'rajo' <rajo AT platon.sk>   ICQ #:  257322664   ,''`.
Platon Software Development Group         http://platon.sk/  : :' :
GnuPG key: http://rajo.platon.sk/en/show,gpgkey              `. `'
http://www.gnu.org/philosophy/no-word-attachments.html         `-  
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Bram Moolenaar

Lubomir Host wrote:

> On Thu, Jun 16, 2005 at 07:30:25PM +0200, Bram Moolenaar wrote:
> > Although characters 0x80 to 0xa0 officially do not appear in iso-8859-2,
> > they may appear in a file anyway.  They don't make the file unreadable,
> > like what would happen for using illegal bytes in UTF-8.
>
> If there is a "mess" in a file, fallback to latin1 should be the best
> way. My patch affects *only* those users, who are using latin2/cp1250
> (both) encodings. For these peoples the simplest way to distinguish
> between latin2 and cp1250 is searching for characters between 0x7F and
> 0xA0. In general cp1250 file these characters appear in a file (with
> more than 99% probability):
>
> cp1250:
> 0x8A 0x0160 # LATIN CAPITAL LETTER S WITH CARON
> 0x8C 0x015A # LATIN CAPITAL LETTER S WITH ACUTE
> 0x8D 0x0164 # LATIN CAPITAL LETTER T WITH CARON
> 0x8E 0x017D # LATIN CAPITAL LETTER Z WITH CARON
> 0x8F 0x0179 # LATIN CAPITAL LETTER Z WITH ACUTE
> 0x9A 0x0161 # LATIN SMALL LETTER S WITH CARON
> 0x9B 0x203A # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
> 0x9C 0x015B # LATIN SMALL LETTER S WITH ACUTE
> 0x9D 0x0165 # LATIN SMALL LETTER T WITH CARON
> 0x9E 0x017E # LATIN SMALL LETTER Z WITH CARON
>
> Note: these characters *doesn't* appear in iso-8859-2 file.

This solution is very specific for one encoding.  I prefer a generic
solution that can handle multiple encodings.

> > It would be useful to detect the difference between cp1250 and
> > iso-8859-2 if they both appear in 'fencs'.  Even though this does depend
> > on certain characters to appear in the file and iso-8859-2 appearing
> > first.
>
> They are also other possibilities. Please, take a look at
> http://trific.ath.cx/software/enca/ . Maybe linking with libenca library
> should be usefull. Enca guess the right encoding on statistics(?) basis.
> This is CPU expensive, but the best way.

I don't like guessing.  Behavior must be predictable.  I think this kind
of guessing requires prompting the user to make a choice (showing some
text how the result would look like).  Otherwise the user would blindly
think Vim always picks the right fileencoding and only finds out
afterwards that things went wrong.  That can be done with an extra item
in 'fileencodings', so that it's only done when the user wants it (e.g.,
after detecting utf-8).

> > I think we should at least check that another 8-bit encoding follows
> > iso-8859-2.  If it's the last one it should be used despite of
> > characters that don't appear in that encoding.  Those characters also
> > don't appear in latin1, thus when latin1 comes after iso-8859-2 you
> > might still want to use iso-8859-2.
>
> This sounds very complicated and to hard to explain in manual (for
> non-native English speakers ;-) ).
>
> We have 2 ways:
>
> 1. - for stable 6.x version:
>    - use my hack (we can distinguish between cp1250/latin2)
>    - keep 'fileencodings' option

I'm not going to change Vim 6.  It's in bugfix mode.

> 2. - for development 7.x version:
>    - use libenca library or include guessing algoritmus from this library
>    - remove 'fileencodings' option (useless in this case)

Of course we do need the 'fileencodings' option.

--
There's no place like $(HOME)!

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Mikołaj Machowski
Dnia pi?tek 17 czerwiec 2005 10:54, Bram Moolenaar napisa?:
> >
> > Note: these characters *doesn't* appear in iso-8859-2 file.
>
> This solution is very specific for one encoding.  I prefer a generic
> solution that can handle multiple encodings.

Heh. I thought you will say so ;)

Another idea:

With recent (vim7) expansion of collection it is possible to use
[\x7f-\xa0] range in VimL. What about for adding some function in
default script collection?

Give special variable
g:check_latin2_cp1250

to set in vimrc. If non zero *and* in fencs are both cp1250 and
iso-8859-2 *and* Vim already knows one of them should be used (no utf-8)
additional function to sort mess will be called.

Also it will be possible for users to tune this function for their
personal uses - contrary to situations with core code.

Maybe too much *and* but I suppose (heck, I **know** :D) Eastern Europe
users will be happy for any help. But, please, give it an official
status.

Quick draft:
function! CheckLatin2cp1250()
    if &fencs =~? 'cp1250' && &fencs =~? 'iso-8859-2\|latin2'
        if search([\x7f-\xa0], 'n')      " May it broke '"?
            set encoding='cp1250'        " what about 8bit
                                         " required on non-Windows
                                         " systems
        else
            " Exactly - what else? Non existing of such characters
            " doesn't automatically means it will be latin2.
        endif
    endif
endfunction

m.

--
LaTeX + Vim = http://vim-latex.sourceforge.net/
Vim-list(s) Users Map: (last change 15 May)
 http://skawina.eu.org/mikolaj/vimlist
CLEWN - http://clewn.sf.net


Reply | Threaded
Open this post in threaded view
|

[PATCH] better detection of fileencoding=windows-1250 and iso-8859-2

Valery Kondakoff
In reply to this post by Bram Moolenaar
Hello, Bram!

Friday, June 17, 2005, 12:54:33 PM, you wrote:

>> They are also other possibilities. Please, take a look at
>> http://trific.ath.cx/software/enca/ . Maybe linking with libenca library
>> should be usefull. Enca guess the right encoding on statistics(?) basis.
>> This is CPU expensive, but the best way.
BM> I don't like guessing.  Behavior must be predictable.  I think this kind
BM> of guessing requires prompting the user to make a choice (showing some
BM> text how the result would look like).  Otherwise the user would blindly
BM> think Vim always picks the right fileencoding and only finds out
BM> afterwards that things went wrong.  That can be done with an extra item
BM> in 'fileencodings', so that it's only done when the user wants it (e.g.,
BM> after detecting utf-8).

I really like to see some autimatic encoding guessing (with or without
user promting). Here, in Russia we are using _four_ encodings: cp1251,
koi8r, cp866 and UTF-8 - so it is a real problem when you need to
switch encodings manually.

Many other editors offers the encoding guessing. I hope someday Vim
will do th? same...

--
Best regards,
 Valery Kondakoff                            mailto:[hidden email]

PGP key: mailto:[hidden email]?subject=GET%[hidden email]