Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > General Discussion
  New Posts New Posts RSS Feed - Tip for extracting unicode text from PDF files
  FAQ FAQ  Forum Search   Register Register  Login Login

Tip for extracting unicode text from PDF files

 Post Reply Post Reply
Author
Message
Rowan View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 10 Jan 09
Status: Offline
Points: 398
Post Options Post Options   Thanks (0) Thanks(0)   Quote Rowan Quote  Post ReplyReply Direct Link To This Post Topic: Tip for extracting unicode text from PDF files
    Posted: 29 Nov 10 at 1:12PM
(I'll preface this tip by saying that we're going to make a change in 2011 that will render this tip out dated pretty quick, but in the mean time, this should help people who are experiencing trouble extracting unicode text).

For all of the text extraction functions there is a sentence that often gets ignored:

"The result is encoded using UTF-8 in the Delphi and DLL editions of the library."

I've forgotten about this several times myself. The strings returned by the DLL and Delphi editions are UTF-8 strings and as such need to be decoded before you will see the unicode characters. The reason this issue is so easy to overlook is that if there are no unicode characters in your string them the GetPageText will appear to function completely normally. If you're using Delphi, then you can decode the UTF8 string with some code like this:

var
 QP: TQuickPDF;
 S: AnsiString;
 FS: TFileStream;
 UTF8BOM: AnsiString;
begin
 QP := TQuickPDF.Create;
 try
   QP.UnlockKey(' license key here ');
   QP.LoadFromFile('license.pdf');
   S := QP.GetPageText(0);
   FS := TFileStream.Create('license.txt', fmCreate);
   UTF8BOM := #$EF#$BB#$BF;
   FS.Write(UTF8BOM[1], Length(UTF8BOM));
   if Length(S) > 0 then
     FS.Write(S[1], Length(S));
   FS.Free;
 finally
   QP.Free;
 end;
end;

In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.

Originally there was only 8-bit strings in QPL. The current situation is a compromise - most of the functions still use 8-bit strings. Some of the functions return Unicode strings. For the ActiveX these are UTF-16 strings. But for the Delphi and DLL we had to keep compatibility so the solution was to use UTF-8 strings.

Sorry for any confusion that this has caused any of you.

Cheers,
- Rowan.
Back to Top
hbarclay View Drop Down
Team Player
Team Player


Joined: 29 Oct 05
Location: United States
Status: Offline
Points: 39
Post Options Post Options   Thanks (0) Thanks(0)   Quote hbarclay Quote  Post ReplyReply Direct Link To This Post Posted: 14 Dec 10 at 6:20PM
Originally posted by Rowan Rowan wrote:

In a future version of Quick PDF (maybe 7.24) we will move away from 8-bit strings so that the Delphi, DLL and ActiveX editions all use 16-bit strings... this should help avoid a lot of the confusion.



I assume there will be overloaded functions for Delphi so the existing code using ansistrings will continue to work and people can make the change to wide strings when they are ready.

Thanks
Harry

Back to Top
Rowan View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 10 Jan 09
Status: Offline
Points: 398
Post Options Post Options   Thanks (0) Thanks(0)   Quote Rowan Quote  Post ReplyReply Direct Link To This Post Posted: 15 Dec 10 at 1:38PM
Hi Harry,

Yes, we always try to maintain backwards compatibility, so we'll do our best to not make anyones life difficult in upgrading to the new version when we made the change to wide strings.

Cheers,
- Rowan.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store