Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
ExtractFilePageText Inconsistencies (ANSI/Unicode) |
Post Reply |
Author | |
aitchisj
Beginner Joined: 01 Jun 12 Status: Offline Points: 6 |
Post Options
Thanks(0)
Posted: 05 Jun 12 at 11:37PM |
Hi There,
I have some code which is trying to extract text from a PDF document as such: for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance) ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7) next This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF. After scrutinizing the result, I realize there is a bit of a problem. For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether. I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text. Here is a snippet of the text that is extracted:
Here is a snippet of the text I'd expect:
Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue. The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI. It seems that the entire word is being removed if it contains a Unicode character. This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:
To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working. Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word? Thanks in advance for any help that someone might be able to provide. -John |
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
There will be some fixes in the 8.16 beta 3 release to improve this.
The PDF was using a composite font and the hyphen character was not defined in the PDF font. It will now be replaced with a space character. Options 0,1,2 uses a totally different method for text extraction than options 3 - 8. Andrew.
|
|
aitchisj
Beginner Joined: 01 Jun 12 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Andrew,
I appreciate the quick response and hope that this will be resolved in a future release of QPL. Have a great day, John
|
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store