![]() In some cases it didn’t see the difference between windows-1215 and mac-Cyrillic content. Next I found this C# port of Mozilla Universal Charset Detector. I found a nice lib/tool wrapper for MLang here. Try to detect encoding by heuristics on our own.Use some platform API, for example on Windows we have MLang COM component (mlang.dll).Use byte order mark (BOM) - it’s a dummy approach to detect Unicode/ASCII, but actually it doesn’t work as it’s common practice to not have BOM in utf-8 files.There are three approaches in detecting encoding: In this post I’ll talk about converting encoding and in the next one - about generating Word/PDF files. By “important” I mean that it should be known to read and process a file. But besides comments sources contain strings with national letters so encoding is important anyway. So the first idea was to normalize all files in utf-8 encoding despite the fact that comments themselves are not needed for this task. So some files were in ASCII-based encoding (windows-1251) but the others in utf-8. The first issue I encountered was the fact that the solution contains files in different encodings. First of all I describe the context - the library is. Obviously it can be easly automated in many ways. It’s possible to do manually indeed but it’s humiliating for a dev :). This weird task was needed for patenting our product. One day I was asked to assist in creating a PDF-document with all source code of some our library. Changing source files encoding and some fun with PowerShell
0 Comments
Leave a Reply. |