Saturday, December 13 2003 @ 08:18 AM GMT
Contributed by: cocoa
Views: 8,873
Is it possible to auto-detect string encoding in a Cocoa program ?
There is no specific routine to do that in Cocoa (Carbon has one, see at the end of the example code). As a matter of good practice, the encoding style of a text should be determined by an higher level protocol outside of the text itself (like MIME types ad others…). To scan the text to determine its encoding may prove to be difficult and unreliable in the general case.
However, a lot of standard text format (like HTML, XML, …) do include in their document structure the definition of the encoding used. If you know about the kind of text you can then parse the beginning of it to determine its encoding.
in the case you write a text editor and you want to provide an auto-detect read file routine, you may try to read the file first in the most "high level" encoding and fall back to simpliest one in case of failure: try first in UTF-32, if it fails fall back to UTF-16, then UTF-8 then ASCII, ...
A sequence of
NSString *myString = nil ;
NSStringEncoding myEncodingToTest[] = { ...put here the encoding you want to test...};
int i, howManyEncodings = sizeof(myEncodingToTest) / sizeof(NSStringEncoding) ;
for(i=0; (i< howManyEncodings) && (myString == nil) ; i++) {
NS_DURING
myString = [[NSString alloc] initWithData:theData encoding:myEn] ;
NS_HANDLER
NS_ENDHANDLER
}
if (myString == nil) {
// failed to parse the string in a "valid" encoding
return NO ;
}
// success
return YES ;
will be a simplistic way of solving the problem. Once at the ASCII level, you still may need help from the user because, there is nothing to tell you if the ASCII file has been created with Windows, ISO or Mac encoding, etc. Just the ending of lines may give some pointing but even this is not 100% reliable...
A more customizable version of the above principle could be the following one (you may improve it to fit your needs, especially decide which chars below ' ' are acceptable and if 0x7F is valid or not in a string (often used as DEL character).
Text examples in various encoding
W3C about XML encoding detection
Charguess for Ruby