devanagari again

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

devanagari again

Hari Krishna Dara

I would like to know if anyone was successful editing devanagari script
in Vim on windows. I have searched the archive and found at least one
thread for the same issue which didn't seem to conclude anything.

http://article.gmane.org/gmane.editors.vim/17809/match=devanagari

I have a utf-8 file when opened in MS Word shows it properly after
selecting the Courier New font (I first added the language support to MS
Office), but doing the same in vim makes no difference. I changed the
'guifont' to "Courier_New:h10:cDEFAULT" but no difference. May be I am
not choosing the right character set. How do I know what the name
of the character set should be?

I am using gvim 6.3.86 on Windose XP.

--
Thank you,
Hari


               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: devanagari again

A.J.Mechelynck
----- Original Message -----
From: "Hari Krishna Dara" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, September 15, 2005 1:40 AM
Subject: devanagari again


>
> I would like to know if anyone was successful editing devanagari script
> in Vim on windows. I have searched the archive and found at least one
> thread for the same issue which didn't seem to conclude anything.
>
> http://article.gmane.org/gmane.editors.vim/17809/match=devanagari
>
> I have a utf-8 file when opened in MS Word shows it properly after
> selecting the Courier New font (I first added the language support to MS
> Office), but doing the same in vim makes no difference. I changed the
> 'guifont' to "Courier_New:h10:cDEFAULT" but no difference. May be I am
> not choosing the right character set. How do I know what the name
> of the character set should be?
>
> I am using gvim 6.3.86 on Windose XP.
>
> --
> Thank you,
> Hari

1. MS Word files are not text files, they have a proprietary format. To read
the contents of such a file in gvim, you need to first save the file in a
text format. If WordPad can read the file, it can save it in UTF-16le with
BOM using "Save as..." and selecting "Unicode text". The markup (bold or
italics, font face and size, margins, page sizes, etc.) will be lost. Such a
file can be read in gvim if 'encoding' is "utf-8" and 'fileencodings' starts
with "ucs-bom".

2. If you have a text file in some Unicode encosing such as UTF-8 or UTF-16,
and 'encoding' is UTF-8, and on opening the file gvim sets 'fileencoding' to
the proper encoding (if 'fileencoding' is set to the null string then the
value of 'encoding' is assumed), and if your Courier New font has glyphs for
whatever is in the file, then setting 'guifont' to
"Courier_New:h10:cDEFAULT" ought to be able in theory to display what is in
your file.

3. The above is for gvim. Console vim is of course always restricted to what
the terminal in which it runs can display.

4. ***However*** in the case of nagari script, there may be problems with
presentation forms, since some letters have different shapes at the begin of
a word, some letter sequences have shapes different than the component
letters written next to each other, and, even though the general direction
of the script is left-to-right, some vowels (such as, IIRC, long i) are
written to the left of the consonant they follow. To address these problems
to the full, I guess a special "nagari" feature, similar to the "arabic" and
"farsi" features but probably simpler, would be necessary; and AFAIK nobody
has yet written such a feature for gvim.


Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: devanagari again

Hari Krishna Dara
In reply to this post by Hari Krishna Dara

Thanks Tony. Please see my response below.

On Thu, 15 Sep 2005 at 2:26am, Tony Mechelynck wrote:

> ----- Original Message -----
> From: "Hari Krishna Dara" <[hidden email]>
> To: <[hidden email]>
> Sent: Thursday, September 15, 2005 1:40 AM
> Subject: devanagari again
>
>
> >
> > I would like to know if anyone was successful editing devanagari script
> > in Vim on windows. I have searched the archive and found at least one
> > thread for the same issue which didn't seem to conclude anything.
> >
> > http://article.gmane.org/gmane.editors.vim/17809/match=devanagari
> >
> > I have a utf-8 file when opened in MS Word shows it properly after
> > selecting the Courier New font (I first added the language support to MS
> > Office), but doing the same in vim makes no difference. I changed the
> > 'guifont' to "Courier_New:h10:cDEFAULT" but no difference. May be I am
> > not choosing the right character set. How do I know what the name
> > of the character set should be?
> >
> > I am using gvim 6.3.86 on Windose XP.
> >
> > --
> > Thank you,
> > Hari
>
> 1. MS Word files are not text files, they have a proprietary format. To read
> the contents of such a file in gvim, you need to first save the file in a
> text format. If WordPad can read the file, it can save it in UTF-16le with
> BOM using "Save as..." and selecting "Unicode text". The markup (bold or
> italics, font face and size, margins, page sizes, etc.) will be lost. Such a
> file can be read in gvim if 'encoding' is "utf-8" and 'fileencodings' starts
> with "ucs-bom".

I understand that native MS Word format is not text, but it can be used
as a glorified notepad to edit text files too (if you can handle the
file conversion dialog boxes before and after working with the file),
which is what I did. When I open the utf-8 file in Word, it detects the
encoding correctly and has it already selected in the file conversion
dialog. However, the default font that I had set doesn't have the
glyphs, so when I select another font such as Courier New, it displays
correctly.

>
> 2. If you have a text file in some Unicode encosing such as UTF-8 or UTF-16,
> and 'encoding' is UTF-8, and on opening the file gvim sets 'fileencoding' to
> the proper encoding (if 'fileencoding' is set to the null string then the
> value of 'encoding' is assumed), and if your Courier New font has glyphs for
> whatever is in the file, then setting 'guifont' to
> "Courier_New:h10:cDEFAULT" ought to be able in theory to display what is in
> your file.

When I open this file in gvim, it doesn't detect 'encoding' or
'fileencoding' correct and displays the contents as if it is a binary
file (though 'binary' is not set). When I manually set the 'encoding' to
"utf-8" it seems to recognize the characters correctly, as I now see one
box for each character, but selecting any font makes no difference. This
is almost like what I first see in Word, before selecting the right
font, except that in the very first word, I see <feff> followed by two
blocks instead of the four "blocks" that Word shows. The "<feff>" seems
to be some symbol, as I can't move cursor into it.

If I set 'fileencoding' before opening the file using the "++enc=utf-8"
option, and set 'encoding' also to "utf-8", I see a bunch of <bf>'s for
each of the non-english characters in it. What does the <bf> stand for?

I have downloaded some special fonts for devanagari, but unfortunately
none of them are fixed width so I can't try them in vim, howerver
selecting "Courier New" in Word shows the characters as expected, but
the same doesn't work in gvim.

>
> 3. The above is for gvim. Console vim is of course always restricted to what
> the terminal in which it runs can display.

I understand.

>
> 4. ***However*** in the case of nagari script, there may be problems with
> presentation forms, since some letters have different shapes at the begin of
> a word, some letter sequences have shapes different than the component
> letters written next to each other, and, even though the general direction
> of the script is left-to-right, some vowels (such as, IIRC, long i) are
> written to the left of the consonant they follow. To address these problems
> to the full, I guess a special "nagari" feature, similar to the "arabic" and
> "farsi" features but probably simpler, would be necessary; and AFAIK nobody
> has yet written such a feature for gvim.

For now, I just wanted to be able to view such files and hope that full
support will be added soon.

Thanks a lot,
Hari


       
               
______________________________________________________
Yahoo! for Good
Donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/ 

Reply | Threaded
Open this post in threaded view
|

Re: devanagari again

A.J.Mechelynck
This e-mail is in UTF-8, it contains devanagari letters and digits
----- Original Message -----
From: "Hari Krishna Dara" <[hidden email]>
To: "Tony Mechelynck" <[hidden email]>
Cc: <[hidden email]>
Sent: Thursday, September 15, 2005 8:42 PM
Subject: Re: devanagari again


>
> Thanks Tony. Please see my response below.
>
> On Thu, 15 Sep 2005 at 2:26am, Tony Mechelynck wrote:
>
>> ----- Original Message -----
>> From: "Hari Krishna Dara" <[hidden email]>
>> To: <[hidden email]>
>> Sent: Thursday, September 15, 2005 1:40 AM
>> Subject: devanagari again
>>
>>
>> >
>> > I would like to know if anyone was successful editing devanagari script
>> > in Vim on windows. I have searched the archive and found at least one
>> > thread for the same issue which didn't seem to conclude anything.
>> >
>> > http://article.gmane.org/gmane.editors.vim/17809/match=devanagari
>> >
>> > I have a utf-8 file when opened in MS Word shows it properly after
>> > selecting the Courier New font (I first added the language support to
>> > MS
>> > Office), but doing the same in vim makes no difference. I changed the
>> > 'guifont' to "Courier_New:h10:cDEFAULT" but no difference. May be I am
>> > not choosing the right character set. How do I know what the name
>> > of the character set should be?
>> >
>> > I am using gvim 6.3.86 on Windose XP.
>> >
>> > --
>> > Thank you,
>> > Hari
>>
>> 1. MS Word files are not text files, they have a proprietary format. To
>> read
>> the contents of such a file in gvim, you need to first save the file in a
>> text format. If WordPad can read the file, it can save it in UTF-16le
>> with
>> BOM using "Save as..." and selecting "Unicode text". The markup (bold or
>> italics, font face and size, margins, page sizes, etc.) will be lost.
>> Such a
>> file can be read in gvim if 'encoding' is "utf-8" and 'fileencodings'
>> starts
>> with "ucs-bom".
>
> I understand that native MS Word format is not text, but it can be used
> as a glorified notepad to edit text files too (if you can handle the
> file conversion dialog boxes before and after working with the file),
> which is what I did. When I open the utf-8 file in Word, it detects the
> encoding correctly and has it already selected in the file conversion
> dialog. However, the default font that I had set doesn't have the
> glyphs, so when I select another font such as Courier New, it displays
> correctly.
>
>>
>> 2. If you have a text file in some Unicode encosing such as UTF-8 or
>> UTF-16,

oops! s/\<encosing\>/encoding/

>> and 'encoding' is UTF-8, and on opening the file gvim sets 'fileencoding'
>> to
>> the proper encoding (if 'fileencoding' is set to the null string then the
>> value of 'encoding' is assumed), and if your Courier New font has glyphs
>> for
>> whatever is in the file, then setting 'guifont' to
>> "Courier_New:h10:cDEFAULT" ought to be able in theory to display what is
>> in
>> your file.
>
> When I open this file in gvim, it doesn't detect 'encoding' or
> 'fileencoding' correct and displays the contents as if it is a binary
> file (though 'binary' is not set). When I manually set the 'encoding' to
> "utf-8" it seems to recognize the characters correctly, as I now see one
> box for each character, but selecting any font makes no difference. This
> is almost like what I first see in Word, before selecting the right
> font, except that in the very first word, I see <feff> followed by two
> blocks instead of the four "blocks" that Word shows. The "<feff>" seems
> to be some symbol, as I can't move cursor into it.

<FEFF> as the first codepoint in a file is the BOM ("byte order mark", but
it indicates not only the byte order but also the specific Unicode encoding,
see below). If your 'fileencodings' (plural), which is a comma-separated
list, _starts_ with the element "ucs-bom", the BOM will be recognised, and
Vim will not only set the correct Unicode encoding (among the 5 possible)
but also do the equivalent of ":setlocal bomb".

IOW the following ought to display the file more or less correctly

    " first, let's not blow up our keyboard charset
    :if &termencoding == "" | let &termencoding = &encoding | endif
    " make sure Vim uses Unicode internally
    :set encoding=utf-8
    " if 'fileencodings' (plural) is set, Vim will try to guess
'fileencoding' (singular)
    " add the locale's charset as a fallback default if nothing better is
found
    :let fileencodings = "ucs-bom,utf-8," . &termencoding
    " the following guifont format is for non-X11 systems
    :set guifont=Courier_New:h12:cDEFAULT

Notes:
1. My mail client or yours may have "beautified" one or more of the above
lines: any line starting at the left margin should be joined with the
previous one.
2. If your version of gvim allows it, :set guifont=* will open a font
selection menu.
3. In all current versions of gvim, with 'nocompatible' set, ":set
guifont=<Tab>" adds the current value on the command-line so you can
conveniently edit it, then hit Enter to accept or Esc to cancel.
4. I have tried entering a few devanagari characters in Vim; e.g. here are
the 10 digits: ?????????? and here are a few consonants: ??????????? All
these digits and letters were composed in Vim (using ^Vuxxxx where xxxx is
the codepoint in hex) then pasted into this email via the clipboard.
However, none of the fonts I have show them as nagari glyphs in Vim: for
instance, Courier New displays them all as empty boxes. In Outlook Express I
see them correctly using Times New Roman, a proportional font. If you get
the same results then I guess you don't have a fixed-width fonts with nagari
glyphs.
5. Apparently my fixed-widths fonts cover European scripts (various kinds of
Latin plus Cyrillic and Greek), West-Astian (Arabic, Hebrew), East-Asian
(Simplified Chinese, Traditional Chinese, Japanese), but no Indic scripts.

>
> If I set 'fileencoding' before opening the file using the "++enc=utf-8"
> option, and set 'encoding' also to "utf-8", I see a bunch of <bf>'s for
> each of the non-english characters in it. What does the <bf> stand for?

First possibility: the _codepoint_ U+00BF, represented on disk as the
two-byte sequence C2 BF, is a reversed question mark, as used in Spanish at
the start of an interrogative clause. If your 'isprint' option is set
correctly, Vim should display it as �

Second possibility: the _byte_ 0xBF, if present in the disk file other than
as a non-initial byte in a proper multibyte sequence, is _illegal_ in UTF-8.
Legal sequences are as follows (with the exclusion of a few "reserved" or
"forbidden" code points):

    Codepoints                   #bytes        Representation (binary)
    U+0000 to U+007F          1          0xxxxxxx
    U+0080 to U+07FF          2          110xxxxx 10xxxxxx
    U+0800 to U+FFFD         3           1110xxxx 10xxxxxx 10xxxxxx
    U+10000 to U+10FFFD    4            11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(Sequences of up to 6 bytes, according to the same scheme, had originally be
foreseen, supporting codepoints up to U+7FFFFFFF, and Vim still supports
that range, but it has later been decided that no characters will be
allocated above U+10FFFD. So until or unless that decision is reversed --  
which will not happen in any foreseeable future -- UTF-8 byte sequences are
limited to 4 bytes maximum. All codepoints ending in FFFE or FFFF have been
made "illegal", though Vim will let you enter them into a file.)

With 'encoding' set to "utf-8" and with the cursor on a character, use ga to
see its alpha and numeric values, and use g8 to see its byte representation.
If "composing characters" are used (characters which get superimposed on the
preceding character, e.g. in nagari the small oblique line indicating the
absence of a vowel following a given consonant), then each spacing character
will be shown will all composing characters (up to 3) immediately following
it.

The Unicode Consortium has allocated codepoints U+0901 to U+097D to
devanagari, see http://www.unicode.org/charts/PDF/U0900.pdf ; just after
that, U+0981 to U+09FD are allocated to bengali
http://www.unicode.org/charts/PDF/U0980.pdf ; then in sequence: Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala.

UTF-32 uses one 32-bit doubleword for each codepoint and can be big-endian
or little-endian. UTF-16 uses one 16-bit word for codepoints U+0000 to
U+FFFD; it uses two 16-bit words ("high and low surrogates") for the rest;
it can also be big-endian or little-endian. In UTF-8 the highest data bits
always come first (in the "x" spots in the table above). The BOM is
codepoint U+FEFF "ZERO-WIDTH NO-BREAK SPACE" which has a different
representation in each encoding (assuming that a file in little-endian
UTF-16 with BOM will never have a NULL as its first data byte):

FE FF                    UTF-16be
FF FE                    UTF-16le
EF BB BF              UTF-8
00 00 FE FF          UTF-32be
FF FE 00 00          UTF-32le

Because of that use as BOM, the use of ZWNBSP as a "real" character is
deprecated in favour of another codepoint, U+2060 WORD JOINER.

>
> I have downloaded some special fonts for devanagari, but unfortunately
> none of them are fixed width so I can't try them in vim, howerver
> selecting "Courier New" in Word shows the characters as expected, but
> the same doesn't work in gvim. [...]

I'm not sure what Word does if a file contains characters from "strange"
charsets. Maybe it looks for a font possessing a glyph for that codepoint.
Anyway I suspect that that's what browsers are doing since when browsing
multilingual pages I sometimes see glyphs which dont "marry" very well from
the point of view of style, serif, thins and fats, etc.


Best regards,
Tony.