Saving to RTF with unicode characters

AlecBergamini · Post by **AlecBergamini** » Mon Apr 30, 2012 6:57 pm

If you use the demo program at ...\Help\Demos\DelphiUnicode\Editors\Editor 1.

Try to paste the following into the edit
標準
勉強
不具合
変更
報告

and then save it as a RTF file. Then try opening the rtf in another program that will open an RTF. Result is that there are extra characters in the RTF file that are in between the correct characters.

Is there something I can do to properly saver RTF files with unicode characters?

Post by **Sergey Tkachenko** » Tue May 01, 2012 5:55 pm

I copied this text to this demo and saved to RTF.
The result is here: http://www.trichview.com/support/forumfiles/1.rtf
This file can be opened in MS Word or WordPad without problems.

However, your file may be different. Please send it to richviewgmailcom.

AlecBergamini · Post by **AlecBergamini** » Tue May 01, 2012 8:50 pm

Thanks for looking at this. I just sent you an email the the same title as this post. It has an RTF with problem characters attached and the body of the email show was was pasted.

Thanks
Alec Bergamini

Post by **Sergey Tkachenko** » Thu May 03, 2012 1:34 pm

I received it.
Well, the code generated by TRichView may be considered wrong: some RTF readers display it correctly, some not.
Fix: open RVFuncs.pas, find the line

Code: Select all

ansi := RVMakeRTFStr(ansi, SpecialCode, True, RTFControlsToCodes);

(in the latest version, this is the line 1241)

Code: Select all

ansi := RVMakeRTFStr(ansi, SpecialCode, [color=red]False[/color], RTFControlsToCodes);

AlecBergamini · Post by **AlecBergamini** » Thu May 03, 2012 7:11 pm

I tried this change in the source code but out team members with Japanese Windows 7 and keyboards still have the problem.

On of our Japanese developers just sent me the following analysis.

>>>>>>>>>>>>>>>>>>>>>

The cause is in WideCharToMultiByte functions in RVU_UnicodeToAnsi function of Components/WiredRed/wrUI/TRichView/RVUni.pas.
Usually, these WideCharToMultiByte functions are called with argument CodePage=CP_ACP(=0) .
But, when the CodePage is CP_ACP, the result depends on a system setting.
In particular, in Japanese environment, the result code will be Shift_JIS.
Unfortunately, the 2nd byte of Shift_JIS sometimes overlaps with ASCII printable characters(0x20-0x7e).
In the TRichView, since the characters which matched ASCII printable characters are put again, this problem occurs.

For example, please try to add a line as follows;
--- C:/Users/kudoya/AppData/Local/Temp/star3799181419797552171.pas Wed May 02 15:59:13 2012
+++ C:/Components/WiredRed/wrUI/TRichView/RVUni.pas Wed May 02 15:56:58 2012
@@ -1465,12 +1465,13 @@
function RVU_UnicodeToAnsi(CodePage: Cardinal; const s: TRVRawByteString): TRVAnsiString;
var l: Integer;
DefChar: Char;
Flags: Integer;
Len: Integer;
begin
+ CodePage := 1252;
if Length(s)=0 then begin
Result := '';
exit;
end;
RVCheckUni(Length(s));
DefChar := RVDEFAULTCHARACTER;

It must work well by this modification.
But, I do not think the modification is good.
I want you will find a good place where the code page is specified.
<<<<<<<<<<<<<<<<<<<<<<

Based on the, we tried setting all of our Styles to use CharSet = ANSI_CHARSET instead of DEFAULT_CHARSET. This works and the problem goes away. I am a bit concerned that not letting the system choose the charset based on what is set for the system my hurt us later though. I must say, that I don't really understand the implications of choosing ASCI_CHARSET.

Post by **Sergey Tkachenko** » Fri May 04, 2012 11:31 am

When saving a Unicode character to RTF, its Unicode code (independent of the code page/charset) and its non-Unicode alternative is written.
RTF readers that support Unicode just ignore this non-Unicode alternative and read only Unicode text. This alternative is saved only for RTF readers that do not understand Unicode.

It is absolutely ok that WideCharToMultiByte returns two bytes, and some of these bytes are the same as western characters. A Unicode RTF reader just skips them, a non-Unicode reader must compose a character from these bytes.
The problem with the original RTF code was the following: one of these bytes was the same as "bullet" character, and it was saved not by its code but using \bullet keyword. Some RTF readers treated \bullet as a separate character, so they were not able to compose a 2-byte character from these bytes, and you could see some garbage characters in the result. After the change that I posted above, all non-Unicode alternatives are saved without using named keywords like \bullet, so the problem must be fixed.

In order to write a non-Unicode alternative for Unicode text, we need to convert Unicode character to non-Unicode. To do it, we need to know a code page to convert. TRichView calculates code page in the following way:
1) if this text style's Charset <> DEFAULT_CHARSET, the code page is calculated by the Charset. For example, if Charset=SHIFTJIS_CHARSET, then CodePage=932 (Japanese). But for Unicode text in TRichView, it is likely that all text has DEFAULT_CHARSET.
2) in case of default charset, TRVStyle.DefCodePage is used. You can assign the default charset to this value, so you do not need to modify TRichView code for this. The default value is CP_ACP, i.e. default system code page. With CP_ACP, the result depends on the system where this RTF is created. But only for RTF readers that do not support Unicode! As I said, Unicode RTF readers just skips this non-Unicode content. Other readers will be able to read it correctly only if the system code page is the same as at the computer that saves this file.

Can you send me an example of this problematic RTF?

PS: There is very simple solution. You can simple exlcude rvrtfDuplicateUnicode from RTFOptions, and only Unicode characters will be saved, without a non-Unicode alternative.