Debenu Quick PDF Library - PDF SDK Community Forum : ExtractFilePageText Inconsistencies (ANSI/Unicode)

Debenu Quick PDF Library - PDF SDK Community Forum : ExtractFilePageText Inconsistencies (ANSI/Unicode) http://www.quickpdf.org/forum/ Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved. Mon, 11 May 2026 21:36:07 +0000 Thu, 07 Jun 2012 17:12:04 +0000 http://blogs.law.harvard.edu/tech/rss Web Wiz Forums 11.01 360 www.quickpdf.org/forum/RSS_post_feed.asp?TID=2293 <![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]> http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png http://www.quickpdf.org/forum/ <![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : Andrew,I appreciate the quick...]]> http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9743.html#9743 Author: aitchisj
Subject: 2293
Posted: 07 Jun 12 at 5:12PM

Andrew,

I appreciate the quick response and hope that this will be resolved in a future release of QPL.

Have a great day,

John

]]> Thu, 07 Jun 2012 17:12:04 +0000 http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9743.html#9743 <![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : There will be some fixes in the...]]> http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9739.html#9739 Author: AndrewC
Subject: 2293
Posted: 07 Jun 12 at 2:08PM

There will be some fixes in the 8.16 beta 3 release to improve this.

The PDF was using a composite font and the hyphen character was not defined in the PDF font. It will now be replaced with a space character.

Options 0,1,2 uses a totally different method for text extraction than options 3 - 8.

Andrew.

]]> Thu, 07 Jun 2012 14:08:34 +0000 http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9739.html#9739 <![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : Hi There,I have some code which...]]> http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9725.html#9725 Author: aitchisj
Subject: 2293
Posted: 05 Jun 12 at 11:37PM

Hi There,

I have some code which is trying to extract text from a PDF document as such:

for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance)

ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7)

This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF. After scrutinizing the result, I realize there is a bit of a problem. For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether. I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text. Here is a snippet of the text that is extracted:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
or (250) (Victoria).

Here is a snippet of the text I'd expect:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
1-800-663-7206 or (250) 952-2668 (Victoria).

Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue. The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI. It seems that the entire word is being removed if it contains a Unicode character.

This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:

lf you have any difficulties or questions, please call the Teleplan Support Centre at
18006637206 or (250) 9522668 (Victoria).

To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working.

Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word?

Thanks in advance for any help that someone might be able to provide.

-John

]]> Tue, 05 Jun 2012 23:37:03 +0000 http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9725.html#9725