How to tell encoding without open the file?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

How to tell encoding without open the file?

Sean-130

Hello,

My script opens a file through:
    let lines = readfile(my_data_file)

Is it possible to tell the file encoding? At least, I want to know if
it is utf-8 or not.

The file can be converted successfully when I open it with utf-8
setting.

Thanks

Sean
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Tony Mechelynck

On 23/01/09 09:23, Sean wrote:

> Hello,
>
> My script opens a file through:
>      let lines = readfile(my_data_file)
>
> Is it possible to tell the file encoding? At least, I want to know if
> it is utf-8 or not.
>
> The file can be converted successfully when I open it with utf-8
> setting.
>
> Thanks
>
> Sean

If it can be read successfully when opened with UTF-8 setting, then it
is probably UTF-8, or at least compatible with UTF-8. By "compatible
with UTF-8", I mean that any file containing only 7-bit ASCII data is
recorded identically in US-ASCII, Latin1 or UTF-8 so it is not possible
(from the file's data) to tell which of them was the intention.

There are a few exceptions: for instance, mail files (full with headers,
e.g. *.eml files) may include a header line like e.g.

Content-Type: text/plain; charset=iso-8859-1

before the first empty line, to say that they are in Latin1, HTML files
may include, between <HEAD> and </HEAD>, a tag such as

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252" />

to say that they're in cp1252, etc. But I've seen badly-constructed
mails and HTML pages, which lied about their encoding.

In general, Vim cannot easily tell which 8-bit encoding was used to
record a file, because in 8-bit encodings all 256 byte values represent
valid characters. For multibyte encodings, it depends which one you're
trying. Valid UTF-8 bytes and byte sequences are as follows:

- Bytes 0x00 to 0x7F each represent one codepoint in the range U+0000 to
U+007F.
- All other codepoints are represented by one "leader byte" in the range
0xC0 to 0xFD (binary 11xxxxxx) followed by one or more "trailer bytes"
in the range 0xA0 to 0xBF (binary 10xxxxxx). In addition, the leader
byte indicates how long the sequence must be to be valid, as follows:
   * 0xC0 to 0xDF (110xxxxx) are leaders for two-byte sequences
   * 0xE0 to 0xEF (1110xxxx) are leaders for three-byte sequences
   * 0xF0 to 0xF7 (11110xxx) are leaders for four-byte sequences
   * 0xF8 to 0xFB (111110xx) are leaders for five-byte sequences
   * 0xFC and 0xFD (1111110x) are leaders for six-byte sequences
   * 0xFE and 0xFF (1111111x) are never valid bytes in UTF-8.
IOW, the length of the sequence is equal to the number of consecutive
one bits at the top of the leader byte, before the first zero bit. In
both leader and trailer bytes, bits after the first zero bit in each
byte represent the data, always in big-endian order (highest bits first).
Some time after this standard (still used by Vim) was defined, it was
decided that codepoints above U+10FFFD would never be used, which means
that five- and six-byte sequences (and four-byte sequences higher than
F4 8F BF BD) ought not to be used; but Vim still follows the earlier
standard, where sequences of one to six bytes could be used to represent
codepoints in the range U+0000 to U+7FFFFFFF. It was also decided that
codepoints U+xxFFFE and U+xxFFFF (where xx can be any hex number between
00 and 10) would never be used either. There are some other "not a
character" codepoints but their values are less "systematic".

For UTF-16 and UTF-32, the number of "invalid" sequences are much fewer,
though a human can tell that

Mary had a little lamb

is probably Latin1 or UTF-8, that

M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b^@

is probably in UTF-16le, and that

^@M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b

is probably in UTF-16be (where ^@, usually shown in blue in Vim,
represent a null byte).

For CJK encodings, I don't know how to tell whether "only valid bytes
and byte sequences" were used.

All this means that if your 'fileencodings' (plural) is set to
"ucs-bom,utf-8,latin1" (without the quotes) Vim ought to detect
correctly the encoding of any Unicode file that has a BOM and of any
UTF-8 (or UTF-8-compatible) file even if it has no BOM; anything else
will be treated as Latin1 unless you open the file with (for instance
for a GB18030 file)

        :e ++enc=GB18030 filename.ext

(see ":help ++opt") in which case you are _telling_ Vim which
'fileencoding' (singular) to use and the 'fileencodings' (plural)
heuristics don't come into play.

As for the BOM, it is a code which appears before the first character of
many Unicode files; it represents the codepoint U+FEFF ZERO-WIDTH
NO-BREAK SPACE with the following disk bytes:

EF BB BF UTF-8
FE FF UTF-16be
FF FE UTF-16le
00 00 FE FF UTF-32be
FF FE 00 00 UTF-32le

As you can see, it is assumed that no UTF-16le file (with BOM) will ever
start with a null codepoint. I believe this is a reasonable assumption.


Best regards,
Tony.
--
Life is like a penis: when it's soft you can't beat it, and when it's
hard you get fucked.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Sean-130



On Jan 23, 7:21 am, Tony Mechelynck <[hidden email]>
wrote:

> On 23/01/09 09:23, Sean wrote:
>
> > Hello,
>
> > My script opens a file through:
> >      let lines = readfile(my_data_file)
>
> > Is it possible to tell the file encoding? At least, I want to know if
> > it is utf-8 or not.
>
> > The file can be converted successfully when I open it with utf-8
> > setting.
>
> > Thanks
>
> > Sean
>
> If it can be read successfully when opened with UTF-8 setting, then it
> is probably UTF-8, or at least compatible with UTF-8. By "compatible
> with UTF-8", I mean that any file containing only 7-bit ASCII data is
> recorded identically in US-ASCII, Latin1 or UTF-8 so it is not possible
> (from the file's data) to tell which of them was the intention.
>
> There are a few exceptions: for instance, mail files (full with headers,
> e.g. *.eml files) may include a header line like e.g.
>
> Content-Type: text/plain; charset=iso-8859-1
>
> before the first empty line, to say that they are in Latin1, HTML files
> may include, between <HEAD> and </HEAD>, a tag such as
>
> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252" />
>
> to say that they're in cp1252, etc. But I've seen badly-constructed
> mails and HTML pages, which lied about their encoding.
>
> In general, Vim cannot easily tell which 8-bit encoding was used to
> record a file, because in 8-bit encodings all 256 byte values represent
> valid characters. For multibyte encodings, it depends which one you're
> trying. Valid UTF-8 bytes and byte sequences are as follows:
>
> - Bytes 0x00 to 0x7F each represent one codepoint in the range U+0000 to
> U+007F.
> - All other codepoints are represented by one "leader byte" in the range
> 0xC0 to 0xFD (binary 11xxxxxx) followed by one or more "trailer bytes"
> in the range 0xA0 to 0xBF (binary 10xxxxxx). In addition, the leader
> byte indicates how long the sequence must be to be valid, as follows:
>    * 0xC0 to 0xDF (110xxxxx) are leaders for two-byte sequences
>    * 0xE0 to 0xEF (1110xxxx) are leaders for three-byte sequences
>    * 0xF0 to 0xF7 (11110xxx) are leaders for four-byte sequences
>    * 0xF8 to 0xFB (111110xx) are leaders for five-byte sequences
>    * 0xFC and 0xFD (1111110x) are leaders for six-byte sequences
>    * 0xFE and 0xFF (1111111x) are never valid bytes in UTF-8.
> IOW, the length of the sequence is equal to the number of consecutive
> one bits at the top of the leader byte, before the first zero bit. In
> both leader and trailer bytes, bits after the first zero bit in each
> byte represent the data, always in big-endian order (highest bits first).
> Some time after this standard (still used by Vim) was defined, it was
> decided that codepoints above U+10FFFD would never be used, which means
> that five- and six-byte sequences (and four-byte sequences higher than
> F4 8F BF BD) ought not to be used; but Vim still follows the earlier
> standard, where sequences of one to six bytes could be used to represent
> codepoints in the range U+0000 to U+7FFFFFFF. It was also decided that
> codepoints U+xxFFFE and U+xxFFFF (where xx can be any hex number between
> 00 and 10) would never be used either. There are some other "not a
> character" codepoints but their values are less "systematic".
>
> For UTF-16 and UTF-32, the number of "invalid" sequences are much fewer,
> though a human can tell that
>
> Mary had a little lamb
>
> is probably Latin1 or UTF-8, that
>
> M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b^@
>
> is probably in UTF-16le, and that
>
> ^@M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b
>
> is probably in UTF-16be (where ^@, usually shown in blue in Vim,
> represent a null byte).
>
> For CJK encodings, I don't know how to tell whether "only valid bytes
> and byte sequences" were used.
>
> All this means that if your 'fileencodings' (plural) is set to
> "ucs-bom,utf-8,latin1" (without the quotes) Vim ought to detect
> correctly the encoding of any Unicode file that has a BOM and of any
> UTF-8 (or UTF-8-compatible) file even if it has no BOM; anything else
> will be treated as Latin1 unless you open the file with (for instance
> for a GB18030 file)
>
>         :e ++enc=GB18030 filename.ext
>
> (see ":help ++opt") in which case you are _telling_ Vim which
> 'fileencoding' (singular) to use and the 'fileencodings' (plural)
> heuristics don't come into play.
>
> As for the BOM, it is a code which appears before the first character of
> many Unicode files; it represents the codepoint U+FEFF ZERO-WIDTH
> NO-BREAK SPACE with the following disk bytes:
>
> EF BB BF        UTF-8
> FE FF           UTF-16be
> FF FE           UTF-16le
> 00 00 FE FF     UTF-32be
> FF FE 00 00     UTF-32le
>
> As you can see, it is assumed that no UTF-16le file (with BOM) will ever
> start with a null codepoint. I believe this is a reasonable assumption.
>
> Best regards,
> Tony.

Thanks Tony for help.

> For CJK encodings, I don't know how to tell whether "only valid bytes
> and byte sequences" were used.

I am working on vim scripting, and I don't open up file.  But I am
sure that the file being parsed by my script can be either valid UTF-8
or valid cp936.

The question is, for CJK encodings file, can we tell it is not valid
UTF-8 from scripting? I cannot find any API for it.

Thanks

Sean
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Tony Mechelynck

On 23/01/09 19:36, Sean wrote:

>
>
> On Jan 23, 7:21 am, Tony Mechelynck<[hidden email]>
> wrote:
>> On 23/01/09 09:23, Sean wrote:
>>
>>> Hello,
>>> My script opens a file through:
>>>       let lines = readfile(my_data_file)
>>> Is it possible to tell the file encoding? At least, I want to know if
>>> it is utf-8 or not.
>>> The file can be converted successfully when I open it with utf-8
>>> setting.
>>> Thanks
>>> Sean
>> If it can be read successfully when opened with UTF-8 setting, then it
>> is probably UTF-8, or at least compatible with UTF-8. By "compatible
>> with UTF-8", I mean that any file containing only 7-bit ASCII data is
>> recorded identically in US-ASCII, Latin1 or UTF-8 so it is not possible
>> (from the file's data) to tell which of them was the intention.
>>
>> There are a few exceptions: for instance, mail files (full with headers,
>> e.g. *.eml files) may include a header line like e.g.
>>
>> Content-Type: text/plain; charset=iso-8859-1
>>
>> before the first empty line, to say that they are in Latin1, HTML files
>> may include, between<HEAD>  and</HEAD>, a tag such as
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252" />
>>
>> to say that they're in cp1252, etc. But I've seen badly-constructed
>> mails and HTML pages, which lied about their encoding.
>>
>> In general, Vim cannot easily tell which 8-bit encoding was used to
>> record a file, because in 8-bit encodings all 256 byte values represent
>> valid characters. For multibyte encodings, it depends which one you're
>> trying. Valid UTF-8 bytes and byte sequences are as follows:
>>
>> - Bytes 0x00 to 0x7F each represent one codepoint in the range U+0000 to
>> U+007F.
>> - All other codepoints are represented by one "leader byte" in the range
>> 0xC0 to 0xFD (binary 11xxxxxx) followed by one or more "trailer bytes"
>> in the range 0xA0 to 0xBF (binary 10xxxxxx). In addition, the leader
>> byte indicates how long the sequence must be to be valid, as follows:
>>     * 0xC0 to 0xDF (110xxxxx) are leaders for two-byte sequences
>>     * 0xE0 to 0xEF (1110xxxx) are leaders for three-byte sequences
>>     * 0xF0 to 0xF7 (11110xxx) are leaders for four-byte sequences
>>     * 0xF8 to 0xFB (111110xx) are leaders for five-byte sequences
>>     * 0xFC and 0xFD (1111110x) are leaders for six-byte sequences
>>     * 0xFE and 0xFF (1111111x) are never valid bytes in UTF-8.
>> IOW, the length of the sequence is equal to the number of consecutive
>> one bits at the top of the leader byte, before the first zero bit. In
>> both leader and trailer bytes, bits after the first zero bit in each
>> byte represent the data, always in big-endian order (highest bits first).
>> Some time after this standard (still used by Vim) was defined, it was
>> decided that codepoints above U+10FFFD would never be used, which means
>> that five- and six-byte sequences (and four-byte sequences higher than
>> F4 8F BF BD) ought not to be used; but Vim still follows the earlier
>> standard, where sequences of one to six bytes could be used to represent
>> codepoints in the range U+0000 to U+7FFFFFFF. It was also decided that
>> codepoints U+xxFFFE and U+xxFFFF (where xx can be any hex number between
>> 00 and 10) would never be used either. There are some other "not a
>> character" codepoints but their values are less "systematic".
>>
>> For UTF-16 and UTF-32, the number of "invalid" sequences are much fewer,
>> though a human can tell that
>>
>> Mary had a little lamb
>>
>> is probably Latin1 or UTF-8, that
>>
>> M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b^@
>>
>> is probably in UTF-16le, and that
>>
>> ^@M^@a^@r^@y^@ ^@h^@a^@d^@ ^@a^@ ^@l^@i^@t^@t^@l^@e^@ ^@l^@a^@m^@b
>>
>> is probably in UTF-16be (where ^@, usually shown in blue in Vim,
>> represent a null byte).
>>
>> For CJK encodings, I don't know how to tell whether "only valid bytes
>> and byte sequences" were used.
>>
>> All this means that if your 'fileencodings' (plural) is set to
>> "ucs-bom,utf-8,latin1" (without the quotes) Vim ought to detect
>> correctly the encoding of any Unicode file that has a BOM and of any
>> UTF-8 (or UTF-8-compatible) file even if it has no BOM; anything else
>> will be treated as Latin1 unless you open the file with (for instance
>> for a GB18030 file)
>>
>>          :e ++enc=GB18030 filename.ext
>>
>> (see ":help ++opt") in which case you are _telling_ Vim which
>> 'fileencoding' (singular) to use and the 'fileencodings' (plural)
>> heuristics don't come into play.
>>
>> As for the BOM, it is a code which appears before the first character of
>> many Unicode files; it represents the codepoint U+FEFF ZERO-WIDTH
>> NO-BREAK SPACE with the following disk bytes:
>>
>> EF BB BF        UTF-8
>> FE FF           UTF-16be
>> FF FE           UTF-16le
>> 00 00 FE FF     UTF-32be
>> FF FE 00 00     UTF-32le
>>
>> As you can see, it is assumed that no UTF-16le file (with BOM) will ever
>> start with a null codepoint. I believe this is a reasonable assumption.
>>
>> Best regards,
>> Tony.
>
> Thanks Tony for help.
>
>> For CJK encodings, I don't know how to tell whether "only valid bytes
>> and byte sequences" were used.
>
> I am working on vim scripting, and I don't open up file.  But I am
> sure that the file being parsed by my script can be either valid UTF-8
> or valid cp936.
>
> The question is, for CJK encodings file, can we tell it is not valid
> UTF-8 from scripting? I cannot find any API for it.
>
> Thanks
>
> Sean

Hm, you could tell by examining all bytes from start to end, and
checking that you encounter only valid UTF-8 sequences as described
above; but it would be expensive.

If you know that your file is either valid UTF-8 or valid cp936, you
could set 'fileencodings' (a global option!) as follows

        let save_fencs = &fencs
        set fencs=ucs-bom,utf-8,prc,latin1

and then open the file, for instance (assuming we're in a function) with

        exe "new" a:fname
        if &fenc !~? '^u' && &fenc != 'cp936' && &fenc != 'euc-cn'
                let &fencs = save_fencs
                q!
                echoerr 'Encoding not recognized'
                return -1
        endif

The above (untested) assumes that we accept Unicode files with BOM (also
e.g. UTF-16le which is typical of Windows), and paves the way to
recognition of euc-cn (preferred to cp936 on Unix IIUC: "prc" is an
alias for cp936 on Windows and for euc-cn on Unix). You may of course
vary from the above if you have other requirements.


Best regards,
Tony.
--
Gyroscope, n.:
        A wheel or disk mounted to spin rapidly about an axis and also
free to rotate about one or both of two axes perpendicular to each
other and the axis of spin so that a rotation of one of the two
mutually perpendicular axes results from application of torque to the
other when the wheel is spinning and so that the entire apparatus
offers considerable opposition depending on the angular momentum to any
torque that would change the direction of the axis of spin.
                -- Webster's Seventh New Collegiate Dictionary

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Sean-130

Hi Tony,

I cannot assume anything on their encoding/fileencoding or BOM
settings. The best I can try is:

> Hm, you could tell by examining all bytes from start to end, and
> checking that you encounter only valid UTF-8 sequences as described
> above; but it would be expensive.

I don't need to examine the whole file. I believe I only need to
take out one line to test if it is UTF-8 encoding or not. Do we have
such an API ready in vim?

I am playing with the following logic, but it looks not so pretty:

    let lines = readfile(datafile)
    let seed = s:lines[1]

    let apple =  len(seed)
    let orange = len(substitute(iconv(seed,"utf-8",&enc),".","x","g"))
    " ------------------------------------------
    " datafile_utf8: len(apple)=3  len(orange)=1
    " datafile_cjk:  len(apple)=2  len(orange)=2
    " ------------------------------------------
    let localization = 0
    if apple == orange
       let localization = 1
    endif

Please tell me if you have better idea.

Thanks

Sean

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Tony Mechelynck

On 23/01/09 21:12, Sean wrote:

> Hi Tony,
>
> I cannot assume anything on their encoding/fileencoding or BOM
> settings. The best I can try is:
>
>> Hm, you could tell by examining all bytes from start to end, and
>> checking that you encounter only valid UTF-8 sequences as described
>> above; but it would be expensive.
>
> I don't need to examine the whole file. I believe I only need to
> take out one line to test if it is UTF-8 encoding or not. Do we have
> such an API ready in vim?

AFAIK, we don't (other than what I already said); but if the first line
is in 7-bit ASCII (or is empty), it will be recognized as "valid UTF-8
as far as we tested" even if there is data that is not va

>
> I am playing with the following logic, but it looks not so pretty:
>
>      let lines = readfile(datafile)
>      let seed = s:lines[1]
>
>      let apple =  len(seed)
>      let orange = len(substitute(iconv(seed,"utf-8",&enc),".","x","g"))
>      " ------------------------------------------
>      " datafile_utf8: len(apple)=3  len(orange)=1
>      " datafile_cjk:  len(apple)=2  len(orange)=2

len(orange) will be zero if the first line is not valid for the current
'encoding', see ":help iconv()". It will always be equal to len(seed) if
it is 7-bit US-ASCII (which is valid for any recognized encoding except
EBCDIC, which requires +ebcdic compiled-in).

>      " ------------------------------------------
>      let localization = 0
>      if apple == orange
>         let localization = 1
>      endif
>
> Please tell me if you have better idea.
>
> Thanks
>
> Sean

OK, I get the gist of your idea. Here's another try. It should be put in
a function to avoid name clashes.

let lines=readfile(a:datafile)
let l = len(lines)
let i = 0
let u = 1
let c = 1
while (i < l) && (u || c)
        let thisline = lines[i]
        if thisline != ""
                let u = u && iconv(thisline, "UTF-8", "GB18030") != ""
                let c = c && iconv(thisline, "CP936", "UTF-8") != ""
        endif
        let i += 1
endwhile
if u && c
        echo 'ambiguous'
elseif u
        echo 'utf-8'
elseif c
        echo 'cp936'
else
        echo 'invalid'
endif


The idea of the two iconv calls is that neither is a do-nothing, but any
valid cp936 can be converted to UTF-8 and any valid UTF-8 can be
converted to GB18030 (even if it contains "rare hanzi" not representable
in cp936). You should of course have tested first that has(iconv)
returns a true (nonzero) value. The fourfold "if" at the end is of
course just a demo of the values set by the "while" loop in the four
possible cases.


Best regards,
Tony.
--
ARTHUR: Did you say shrubberies?
ROGER:  Yes.  Shrubberies are my trade.  I am a shrubber.  My name is Roger
         the Shrubber.  I arrange, design, and sell shrubberies.
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Sean-130

Tony,

Thank you for your inspiration. Eventually, I have a working code:

http://maxiangjiang.googlepages.com/vimim.vim.html
The code block is between line 513 and line 542

The idea is to take out one character from my datafile, which has to
be:
line_1: alphabeta CJK
line_2: alphabeta CJK
line_3: alphabeta CJK

When encoding is non-utf8, and the datafile is utf8, I even could not
trust split() in
let seed = split(seed,'\s\+')[1]

Anyway, it is good enough for my purpose:
http://maxiangjiang.googlepages.com/vimim.gif

Sean


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Yongwei Wu
In reply to this post by Sean-130

2009/1/23 Sean <[hidden email]>:

>
> Hello,
>
> My script opens a file through:
>    let lines = readfile(my_data_file)
>
> Is it possible to tell the file encoding? At least, I want to know if
> it is utf-8 or not.
>
> The file can be converted successfully when I open it with utf-8
> setting.

I have a small open-source program that does this.  Search for tellenc
in http://wyw.dcweb.cn/.

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

bill lam
In reply to this post by Sean-130
On Fri, 23 Jan 2009, Sean wrote:

> The idea is to take out one character from my datafile, which has to
> be:
> line_1: alphabeta CJK
> line_2: alphabeta CJK
> line_3: alphabeta CJK
>
> When encoding is non-utf8, and the datafile is utf8, I even could not
> trust split() in
> let seed = split(seed,'\s\+')[1]
>

Why don't you just mandate all authors to put the name of encoding
used in the first line of the datafile?

--
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
唐詩188 杜甫  宿府
    清秋幕府井梧寒  獨宿江城蠟炬殘  永夜角聲悲自語  中天月色好誰看
    風塵荏苒音書絕  關塞蕭條行陸難  已忍伶俜十年事  強移棲息一枝安

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Guido Milanese-4
In reply to this post by Yongwei Wu

On Jan 24, 5:56 am, Yongwei Wu <[hidden email]> wrote:
> I have a small open-source program that does this.  Search for tellenc
> inhttp://wyw.dcweb.cn/.

Very nice! But I think that for simple tasks, even a simple "file
namefile" can be enough (of course in any Unix-like OS). I use 'file'
to detect the encoding and send the file to 'convert' if it is not
UTF-8.

May be simplistic, but if the required task is basic it works also in
a complex script with many if/else branches.

Best regards,
guido, italy
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Benjamin Sergeant
I'd recommend chardet, a python library that works with the same code than Mozilla (rewritten in Python), and web browsers must be good at encoding detection.

http://chardet.feedparser.org/

- Benjamin

On Sat, Jan 24, 2009 at 3:34 AM, Guido Milanese <[hidden email]> wrote:

On Jan 24, 5:56 am, Yongwei Wu <[hidden email]> wrote:
> I have a small open-source program that does this.  Search for tellenc
> inhttp://wyw.dcweb.cn/.

Very nice! But I think that for simple tasks, even a simple "file
namefile" can be enough (of course in any Unix-like OS). I use 'file'
to detect the encoding and send the file to 'convert' if it is not
UTF-8.

May be simplistic, but if the required task is basic it works also in
a complex script with many if/else branches.

Best regards,
guido, italy



--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Tony Mechelynck
In reply to this post by Guido Milanese-4

On 24/01/09 12:34, Guido Milanese wrote:

> On Jan 24, 5:56 am, Yongwei Wu<[hidden email]>  wrote:
>> I have a small open-source program that does this.  Search for tellenc
>> inhttp://wyw.dcweb.cn/.
>
> Very nice! But I think that for simple tasks, even a simple "file
> namefile" can be enough (of course in any Unix-like OS). I use 'file'
> to detect the encoding and send the file to 'convert' if it is not
> UTF-8.
>
> May be simplistic, but if the required task is basic it works also in
> a complex script with many if/else branches.
>
> Best regards,
> guido, italy

I tried it with one of my files, and when I asked for its encoding, with

        file --mime-encoding ~/pub/index.htm

it answered:

pub/index.htm: text/html

meaning, IIU the manpage correctly, that my file has an _encoding_ of
text/html. (I intentionally did _not_ ask for the mime _type_).

This file is the local copy of
http://users.skynet.be/antoine.mechelynck/ which I know is in UTF-8 --
but file didn't say it, which  meant it fails my first sanity test.

I'm on openSUSE 11.1 with file 4.24 and "magic file from
/etc/magic:/usr/share/misc/magic".


Best regards,
Tony.
--
We've sent a man to the moon, and that's 29,000 miles away.  The center
of the Earth is only 4,000 miles away.  You could drive that in a week,
but for some reason nobody's ever done it.
                -- Andy Rooney

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: How to tell encoding without open the file?

Guido Milanese-4

On Jan 25, 5:13 am, Tony Mechelynck <[hidden email]>
wrote:

> I tried it with one of my files, and when I asked for its encoding, with
>
>         file --mime-encoding ~/pub/index.htm
>
> it answered:
>
> pub/index.htm: text/html
>
> meaning, IIU the manpage correctly, that my file has an _encoding_ of
> text/html. (I intentionally did _not_ ask for the mime _type_).
>
> This file is the local copy ofhttp://users.skynet.be/antoine.mechelynck/which I know is in UTF-8 --
> but file didn't say it, which  meant it fails my first sanity test.

You are right (of course): my simplistic solution works only for plain
text files. With HTML, LaTeX and so on, use 'enca', as suggested by Mr
Lutrin <[hidden email]>.
Example using your homepage:
enca -L none am-accueil.html
Universal transformation format 8 bits; UTF-8

More compact answer:
enca -m -L none am-accueil.html
UTF-8
The switch "-L none" means: I am not interested in the language but
only in the encoding -- enca knows only eastern European languages and
Chinese: the author is  David Necas (Yeti) <[hidden email]>.

I think it is a good solution to the problem; much better than 'file',
it seems to me!

guido
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---