Print Page | Close Window

Tip for extracting unicode text from PDF files

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1660
Printed Date: 22 Nov 24 at 7:44PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Tip for extracting unicode text from PDF files
Posted By: Rowan
Subject: Tip for extracting unicode text from PDF files
Date Posted: 29 Nov 10 at 1:12PM
(I'll preface this tip by saying that we're going to make a change in 2011 that will render this tip out dated pretty quick, but in the mean time, this should help people who are experiencing trouble extracting unicode text).

For all of the text extraction functions there is a sentence that often gets ignored:

"The result is encoded using UTF-8 in the Delphi and DLL editions of the library."

I've forgotten about this several times myself. The strings returned by the DLL and Delphi editions are UTF-8 strings and as such need to be decoded before you will see the unicode characters. The reason this issue is so easy to overlook is that if there are no unicode characters in your string them the GetPageText will appear to function completely normally. If you're using Delphi, then you can decode the UTF8 string with some code like this:

var
 QP: TQuickPDF;
 S: AnsiString;
 FS: TFileStream;
 UTF8BOM: AnsiString;
begin
 QP := TQuickPDF.Create;
 try
   QP.UnlockKey(' license key here ');
   QP.LoadFromFile('license.pdf');
   S := QP.GetPageText(0);
   FS := TFileStream.Create('license.txt', fmCreate);
   UTF8BOM := #$EF#$BB#$BF;
   FS.Write(UTF8BOM[1], Length(UTF8BOM));
   if Length(S) > 0 then
     FS.Write(S[1], Length(S));
   FS.Free;
 finally
   QP.Free;
 end;
end;

In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.

Originally there was only 8-bit strings in QPL. The current situation is a compromise - most of the functions still use 8-bit strings. Some of the functions return Unicode strings. For the ActiveX these are UTF-16 strings. But for the Delphi and DLL we had to keep compatibility so the solution was to use UTF-8 strings.

Sorry for any confusion that this has caused any of you.

Cheers,
- Rowan.



Replies:
Posted By: hbarclay
Date Posted: 14 Dec 10 at 6:20PM
Originally posted by Rowan Rowan wrote:

In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.



I assume there will be overloaded functions for Delphi so the existing code using ansistrings will continue to work and people can make the change to wide strings when they are ready.

Thanks
Harry



Posted By: Rowan
Date Posted: 15 Dec 10 at 1:38PM
Hi Harry,

Yes, we always try to maintain backwards compatibility, so we'll do our best to not make anyones life difficult in upgrading to the new version when we made the change to wide strings.

Cheers,
- Rowan.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk