<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : ExtractFilePageText Inconsistencies (ANSI/Unicode)</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : I need help - I can help : ExtractFilePageText Inconsistencies (ANSI/Unicode)]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Mon, 11 May 2026 21:36:07 +0000</pubDate>
  <lastBuildDate>Thu, 07 Jun 2012 17:12:04 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=2293</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : Andrew,I appreciate the quick...]]></title>
   <link>http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9743.html#9743</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1957">aitchisj</a><br /><strong>Subject:</strong> 2293<br /><strong>Posted:</strong> 07 Jun 12 at 5:12PM<br /><br />Andrew,<div><br></div><div>I appreciate the quick response and hope that this will be resolved in a future release of QPL. &nbsp;</div><div>Have a great day,</div><div>John</div>]]>
   </description>
   <pubDate>Thu, 07 Jun 2012 17:12:04 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9743.html#9743</guid>
  </item> 
  <item>
   <title><![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : There will be some fixes in the...]]></title>
   <link>http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9739.html#9739</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1483">AndrewC</a><br /><strong>Subject:</strong> 2293<br /><strong>Posted:</strong> 07 Jun 12 at 2:08PM<br /><br />There will be some fixes in the 8.16 beta 3 release to improve this.<div><br></div><div>The PDF was using a composite font and the hyphen character was not defined in the PDF font. &nbsp;It will now be replaced with a space character.</div><div><br></div><div>Options 0,1,2 uses a totally different method for text extraction than options 3 - 8.</div><div><br></div><div>Andrew.</div>]]>
   </description>
   <pubDate>Thu, 07 Jun 2012 14:08:34 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9739.html#9739</guid>
  </item> 
  <item>
   <title><![CDATA[ExtractFilePageText Inconsistencies (ANSI/Unicode) : Hi There,I have some code which...]]></title>
   <link>http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9725.html#9725</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1957">aitchisj</a><br /><strong>Subject:</strong> 2293<br /><strong>Posted:</strong> 05 Jun 12 at 11:37PM<br /><br />Hi There,<div><br></div><div>I have some code which is trying to extract text from a PDF document as such:</div><div><br></div><div><div>for ll_page = 1 to QuickPDFPageCount(il_quickpdf_instance)&nbsp;</div><div><span ="Apple-tab-span" style="white-space:pre">	</span>ls_text = ls_text + QuickPDFExtractFilePageText(il_quickpdf_instance,ls_filename,"",ll_page,7)</div><div>next</div></div><div><br></div><div>This is working and I really like how ExtractOption = 7 is able to preserve the formatting of text in the PDF. &nbsp;After scrutinizing the result, I realize there is a bit of a problem. &nbsp;For documents which contain telephone numbers that look something like "555-1234", using ExtractOption = 7 ends up excluding the phone number altogether. &nbsp;I soon realized it has nothing to do with it being a phone number, but rather the hyphen is the problem and causes the entire word (or phone number) to be removed from the extracted text. &nbsp;Here is a snippet of the text that is extracted:</div><div><br></div><blockquote style="margin: 0 0 0 40px; border: n&#111;ne; padding: 0px;"><div><div><i>lf you have any difficulties or questions, please call the Teleplan Support Centre at</i></div></div><div><div><i>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; or (250) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Victoria).</i></div></div></blockquote><div><br></div><div>Here is a snippet of the text I'd expect:</div><div><br></div><blockquote style="margin: 0 0 0 40px; border: n&#111;ne; padding: 0px;"><div><div><i>lf you have any difficulties or questions, please call the Teleplan Support Centre at</i></div></div><div><div><i>1-800-663-7206 or (250) 952-2668 (Victoria).</i></div></div></blockquote><div><br></div><div>Digging even further, I've realized that it's not the hyphen's fault either, this is an ANSI vs. Unicode issue. &nbsp;The 'hyphen' isn't actually a hyphen, it's an endash character which is Unicode and not ANSI. &nbsp;It seems that the entire word is being removed if it contains a Unicode character.</div><div><br></div><div>This is inconsistent because if I change my code to use ExtractOption = 0, it has no problem dealing with Unicode character and discards it altogether, resulting in text that looks like this:</div><div><br></div><blockquote style="margin: 0 0 0 40px; border: n&#111;ne; padding: 0px;"><div><div><i>lf you have any difficulties or questions, please call the Teleplan Support Centre at</i></div></div><div><div><i>18006637206 or (250) 9522668 (Victoria).</i></div></div></blockquote><div><i><br></i></div><div>To me, this scenario is much more desirable than the previous scenario; however, there is clearly an inconsistency with how this is working.</div><div><br></div><div>Is there anything I can do to make it so that I can use ExtractOption = 7 and have it discard the Unicode characters (as is done for ExtractOption = 0) rather than discarding the entire word?</div><div><br></div><div>Thanks in advance for any help that someone might be able to provide.</div><div>-John</div><div><br></div>]]>
   </description>
   <pubDate>Tue, 05 Jun 2012 23:37:03 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/extractfilepagetext-inconsistencies-ansi-unicode_topic2293_post9725.html#9725</guid>
  </item> 
 </channel>
</rss>