General Discussion - Unicode text extraction?

Unicode text extraction?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1237
Printed Date: 22 Jun 26 at 7:20PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Unicode text extraction?

Posted By: phildick
Subject: Unicode text extraction?
Date Posted: 14 Oct 09 at 2:36PM

Welcome,

I need a pdf component for Delphi 2009 to extract text from pdf files. I installed and tested QuickPDF. I tried both Delphi 2009 and ActiveX versions and both extracted only ASCII text without any international characters (Polish in my case).

I am a little disappointed, especially because there is a note "Full Unicode support" in the feature list ( http://www.quickpdflibrary.com/products/quickpdf/features.php - http://www.quickpdflibrary.com/products/quickpdf/features.php ).

Is there any way I can extract full text with all characters?

Best regards,

Bartek

Replies:

Posted By: shimax
Date Posted: 15 Oct 09 at 12:41AM

Hello, Bartek

As discussed in

http://www.quickpdf.org/forum/problem-with-span-classhighlightrussia-spann-text_topic1183_post5464.html - http://www.quickpdf.org/forum/problem-with-span-classhighlightrussia-spann-text_topic1183_post5464.html ,

it seems that unicode text extraction does not work well as expected.

In my case as well Japanese characters are not extracted at all.

I contacted with the support, but I have not yet got an answer for a week except they recieved my email. So I think to implement unicode support is a very diffcult task for some reasons or they are so busy for other problems or for developing new features.

Not only in text extraction but also in other features there seems to be many unicode-related problems in QuickPDF. Regretabbly, full unicode support is not true at least as far as the version is 7.16.

Posted By: Wheeley
Date Posted: 15 Oct 09 at 1:35AM

The next release should have more support for unicode. I was told they are removing the function ToPDFUnicode. If they do that, then unicode support must be enhanced somehow.

Wheeley

Posted By: Michel_K17
Date Posted: 15 Oct 09 at 3:55AM

I have received the same assurances as well. They (Debenu) have been very good at addressing specific issues as we bring them up. On the unicode front, at least we can now save/merge PDF files with unicode characters in the path.

Support for unicode characters as part of the metadata is coming with the next beta (which is what I was waiting for). Smile

For text extraction, I don't know.

Michel

-------------
Michel

Posted By: Ingo
Date Posted: 15 Oct 09 at 9:03AM

Hi All!

QuickPDF is a very complete and extensive library
and the unicode-support should touch nearly all modules.
So please be a bit patient. I'm pretty sure that it's only a matter of time ;-)

Cheers, Ingo

Posted By: phildick
Date Posted: 15 Oct 09 at 9:41AM

Hi Ingo,

Maybe it is, but I installed the demo modules in my Delphi 2009 (which is fully Unicode now), and all the string parameters are declared as AnsiString, not String. Even if it's backward compatibility, which I completely understand, there could be a "wide string" version of every string routine, as it was done in Windows API years ago. BTW the last non-Unicode Windows OS was released in 2000 (Windows ME), so it's been almost ten years since.

Furthermore, I imported the ActiveX version in which all parameters are passed as WideString (so it should be fully Unicode), and it produced the same result as earlier - only ANSI characters in the extracted text.

Best regards,

Bartek