General Discussion - Tip for extracting unicode text from PDF files

Tip for extracting unicode text from PDF files

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1660
Printed Date: 03 Nov 25 at 4:10AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Tip for extracting unicode text from PDF files

Posted By: Rowan
Subject: Tip for extracting unicode text from PDF files
Date Posted: 29 Nov 10 at 1:12PM

(I'll preface this tip by saying that we're going to make a change in 2011 that will render this tip out dated pretty quick, but in the mean time, this should help people who are experiencing trouble extracting unicode text).

For all of the text extraction functions there is a sentence that often gets ignored:

"The result is encoded using UTF-8 in the Delphi and DLL editions of the library."

I've forgotten about this several times myself. The strings returned by the DLL and Delphi editions are UTF-8 strings and as such need to be decoded before you will see the unicode characters. The reason this issue is so easy to overlook is that if there are no unicode characters in your string them the GetPageText will appear to function completely normally. If you're using Delphi, then you can decode the UTF8 string with some code like this:

var

QP: TQuickPDF;

S: AnsiString;

FS: TFileStream;

UTF8BOM: AnsiString;

begin

QP := TQuickPDF.Create;

try

QP.UnlockKey(' license key here ');

QP.LoadFromFile('license.pdf');

S := QP.GetPageText(0);

FS := TFileStream.Create('license.txt', fmCreate);

UTF8BOM := #$EF#$BB#$BF;

FS.Write(UTF8BOM[1], Length(UTF8BOM));

if Length(S) > 0 then

FS.Write(S[1], Length(S));

FS.Free;

finally

QP.Free;

end;

In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.

Originally there was only 8-bit strings in QPL. The current situation is a compromise - most of the functions still use 8-bit strings. Some of the functions return Unicode strings. For the ActiveX these are UTF-16 strings. But for the Delphi and DLL we had to keep compatibility so the solution was to use UTF-8 strings.

Sorry for any confusion that this has caused any of you.

Cheers,

- Rowan.

Replies:

Posted By: hbarclay
Date Posted: 14 Dec 10 at 6:20PM

Rowan wrote:

I assume there will be overloaded functions for Delphi so the existing code using ansistrings will continue to work and people can make the change to wide strings when they are ready.

Thanks
Harry

Posted By: Rowan
Date Posted: 15 Dec 10 at 1:38PM

Hi Harry,

Yes, we always try to maintain backwards compatibility, so we'll do our best to not make anyones life difficult in upgrading to the new version when we made the change to wide strings.

Cheers,

- Rowan.