How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

Question

1

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.

asked Apr 14 at 1:54

Eason Wu
162

	heuristics... It's done by a lot of programs in a lot of cases (the Unx file* command being an amazing example). I've done it "manually" (re-inventing my own wheel which works fine) but nowadays I'd simply take "Stephen C"'s answer: re-use existing code already doing it. – SyntaxT3rr0r Apr 14 at 8:42
	@SyntaxT3rr0r: Yeah, it's a good way to resolve this problem. Cause of our limitation for introducing third-party library into products, I would prefer to use my own wheel to deal with it by improving the code I provided. – Eason Wu Apr 15 at 1:26

feedback

heuristics... It's done by a lot of programs in a lot of cases (the Un*x file command being an amazing example). I've done it "manually" (re-inventing my own wheel which works fine) but nowadays I'd simply take "Stephen C"'s answer: re-use existing code already doing it.
@SyntaxT3rr0r: Yeah, it's a good way to resolve this problem. Cause of our limitation for introducing third-party library into products, I would prefer to use my own wheel to deal with it by improving the code I provided.

road to yamburg · Answer 1 · 2011-04-14 02:01:30Z

Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.

As you said, the source file is not sure, since our product supports multiple languages.

Stephen C · Answer 2 · 2011-04-18 00:54:24Z

The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.

I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.

restrict yourself to a certain set of encodings,
insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or
accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings.

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.

I would say these third-party libraries are also cannot identify encodings for the file I encountered. Anyway, thanks for your information, they could be improved to meet my requirement.

tchrist · Answer 3 · 2011-04-14 02:16:33Z

If you are certain that it is a valid Unicode stream, it must be UTF-8 if it has no BOM (since a BOM is neither required nor recommended), and if it does have one, then you know what it is.

If it is just some random encoding, there is no way to know for certain. The best you can hope for is then to only be wrong sometimes, since there is impossible to guess correctly in all cases.

If you can limit the possibilities to a very small subset, it is possible to improve the odds of your guess being right.

The only reliable way is to require the provider to tell you what they are providing. If you want complete reliability, that is your only choice. If you do not require reliability, then you guess — but sometimes guess wrong.

I have the feeling that you must be a Windows person, since the rest of us seldom have cause for BOMs in the first place. I know that I regularly deal with tgagabytes of text (on Macs, Linux, Solaris, and BSD systems), more than 99% of it UTF-8, and only twice have I come across BOM-laden text. I have heard Windows people get stuck with it all the time though. If true this may, or may not, make your choices easier.

Yeah, you are right. Since the files I'm going to handle are type of XML, I will try read the XML processing introduction from the xml file to get encoding information which is passed to InputStream object. If cannot get the encoding from the XML file without BOM, take default encoding (i.e. UTF-8) as its encoding.
@Eason: Then it’s easy, because XML has to have the encoding. Lucky you!

How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

3 Answers