Hi There,
I have some code which is trying to extract text from a PDF document as such:
for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance) ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7) next
This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF. After scrutinizing the result, I realize there is a bit of a problem. For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether. I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text. Here is a snippet of the text that is extracted:
lf you have any difficulties or questions, please call the Teleplan Support Centre at
Here is a snippet of the text I'd expect:
lf you have any difficulties or questions, please call the Teleplan Support Centre at 1-800-663-7206 or (250) 952-2668 (Victoria).
Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue. The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI. It seems that the entire word is being removed if it contains a Unicode character.
This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:
lf you have any difficulties or questions, please call the Teleplan Support Centre at 18006637206 or (250) 9522668 (Victoria).
To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working.
Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word?
Thanks in advance for any help that someone might be able to provide. -John
|