(I'll preface this tip by saying that we're going to make a change in 2011 that will render this tip out dated pretty quick, but in the mean time, this should help people who are experiencing trouble extracting unicode text).
For all of the text extraction functions there is a sentence that often gets ignored:
"The result is encoded using UTF-8 in the Delphi and DLL editions of the library."
I've forgotten about this several times myself. The strings returned by the DLL and Delphi editions are UTF-8 strings and as such need to be decoded before you will see the unicode characters. The reason this issue is so easy to overlook is that if there are no unicode characters in your string them the GetPageText will appear to function completely normally. If you're using Delphi, then you can decode the UTF8 string with some code like this:
var QP: TQuickPDF; S: AnsiString; FS: TFileStream; UTF8BOM: AnsiString; begin QP := TQuickPDF.Create; try QP.UnlockKey(' license key here '); QP.LoadFromFile('license.pdf'); S := QP.GetPageText(0); FS := TFileStream.Create('license.txt', fmCreate); UTF8BOM := #$EF#$BB#$BF; FS.Write(UTF8BOM[1], Length(UTF8BOM)); if Length(S) > 0 then FS.Write(S[1], Length(S)); FS.Free; finally QP.Free; end; end;
In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.
Originally there was only 8-bit strings in QPL. The current situation is a compromise - most of the functions still use 8-bit strings. Some of the functions return Unicode strings. For the ActiveX these are UTF-16 strings. But for the Delphi and DLL we had to keep compatibility so the solution was to use UTF-8 strings.
Sorry for any confusion that this has caused any of you.
Cheers, - Rowan.
|