打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
How to identify different encodings on files without the use of a BOM

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */

public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
   
byte bom[] = new byte[BOM_SIZE];
   
String encoding;
   
int unread;
   
PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
   
int n = pushbackStream.read(bom, 0, bom.length);

   
// Read ahead four bytes and check for BOM marks.
   
if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding
= "UTF-8";
        unread
= n - 3;
   
} else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding
= "UTF-16BE";
        unread
= n - 2;
   
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding
= "UTF-16LE";
        unread
= n - 2;
   
} else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding
= "UTF-32BE";
        unread
= n - 4;
   
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding
= "UTF-32LE";
        unread
= n - 4;
   
} else {
       
// No BOM detected but still could be UTF-16
       
int found = 0;
       
for (int i = 0; i < 4; i++) {
           
if (bom[i] == (byte) 0x00)
                found
++;
       
}

       
if(found >= 2) {
           
if(bom[0] == (byte) 0x00){
                encoding
= "UTF-16BE";
           
}
           
else {
                encoding
= "UTF-16LE";
           
}
            unread
= n;
       
}
       
else {
            encoding
= defaultEncoding;
            unread
= n;
       
}
   
}

   
// Unread bytes if necessary and skip BOM marks.
   
if (unread > 0) {
        pushbackStream
.unread(bom, (n - unread), unread);
   
} else if (unread < -1) {
        pushbackStream
.unread(bom, 0, 0);
   
}

   
// Use given encoding.
   
if (encoding == null) {
        reader
= new InputStreamReader(pushbackStream);
   
} else {
        reader
= new InputStreamReader(pushbackStream, encoding);
   
}
}

public String getEncoding() {
   
return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
   
return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader
.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.

link|improve this question
heuristics... It's done by a lot of programs in a lot of cases (the Un*x file command being an amazing example). I've done it "manually" (re-inventing my own wheel which works fine) but nowadays I'd simply take "Stephen C"'s answer: re-use existing code already doing it. – SyntaxT3rr0r Apr 14 at 8:42
@SyntaxT3rr0r: Yeah, it's a good way to resolve this problem. Cause of our limitation for introducing third-party library into products, I would prefer to use my own wheel to deal with it by improving the code I provided. – Eason Wu Apr 15 at 1:26
feedback

3 Answers

Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.

link|improve this answer
As you said, the source file is not sure, since our product supports multiple languages. – Eason Wu Apr 14 at 4:33
feedback

The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.


I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.

  • restrict yourself to a certain set of encodings,
  • insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or
  • accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings.

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.

link|improve this answer
I would say these third-party libraries are also cannot identify encodings for the file I encountered. Anyway, thanks for your information, they could be improved to meet my requirement. – Eason WuApr 15 at 1:15
feedback

If you are certain that it is a valid Unicode stream, it must be UTF-8 if it has no BOM (since a BOM is neither required nor recommended), and if it does have one, then you know what it is.

If it is just some random encoding, there is no way to know for certain. The best you can hope for is then to only be wrong sometimes, since there is impossible to guess correctly in all cases.

If you can limit the possibilities to a very small subset, it is possible to improve the odds of your guess being right.

The only reliable way is to require the provider to tell you what they are providing. If you want complete reliability, that is your only choice. If you do not require reliability, then you guess — but sometimes guess wrong.

I have the feeling that you must be a Windows person, since the rest of us seldom have cause for BOMs in the first place. I know that I regularly deal with tgagabytes of text (on Macs, Linux, Solaris, and BSD systems), more than 99% of it UTF-8, and only twice have I come across BOM-laden text. I have heard Windows people get stuck with it all the time though. If true this may, or may not, make your choices easier.

link|improve this answer
Yeah, you are right. Since the files I'm going to handle are type of XML, I will try read the XML processing introduction from the xml file to get encoding information which is passed to InputStream object. If cannot get the encoding from the XML file without BOM, take default encoding (i.e. UTF-8) as its encoding. – Eason Wu Apr 15 at 1:13
@Eason: Then it’s easy, because XML has to have the encoding. Lucky you! – tchrist Apr 15 at 3:03
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
去除bom
【Net】StreamWriter.Write 的一点注意事项
第15关笔记
Defining Python Source Code Encodings
Eclipse 修改字符集 | 菜鸟教程
Python中encoding='utf-8-sig'是什么意思
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服