<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Debenu Quick PDF Library - PDF SDK Community Forum : Bug when extracting text</title>
  <link>http://www.quickpdf.org/forum/</link>
  <description><![CDATA[This is an XML content feed of; Debenu Quick PDF Library - PDF SDK Community Forum : General Discussion : Bug when extracting text]]></description>
  <copyright>Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved.</copyright>
  <pubDate>Mon, 04 May 2026 09:28:21 +0000</pubDate>
  <lastBuildDate>Wed, 12 Aug 2009 22:18:44 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.01</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>www.quickpdf.org/forum/RSS_post_feed.asp?TID=1171</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]></title>
   <url>http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png</url>
   <link>http://www.quickpdf.org/forum/</link>
  </image>
  <item>
   <title><![CDATA[Bug when extracting text : Thanks for all your information...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5413.html#5413</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1137">AIM</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 10:18PM<br /><br /><P>Thanks for all your information and suggestions, I think I understood now that my real "problem" of these 3 examples happens at PDF creation time.</P><P>Seems that I have to invest some time and implement a fully working text extraction myself...</P>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 22:18:44 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5413.html#5413</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text : Martin,  Ingo is correct in...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5412.html#5412</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=173">swb1</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 9:21PM<br /><br /><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">Martin,<?: prefix = o ns = "urn:schemas-microsoft-com:office:office" /><O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><O:P>&nbsp;</O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">Ingo is correct in that this is not a bug but rather the nature of the way that the PDF is constructed. There is no rule that says text that appears to be one word when displayed by Acrobat or rendered by QuickPDF actually be stored as one word inside the PDF. I have seen PDFs that were constructed one letter at a time! Each letter would appear as a single text element complete with location and font information. While this is an extremely inefficient way to construct a PDF it works nonetheless and appears to be just fine from the outside when rendered by Acrobat.<O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><O:P>&nbsp;</O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">The text extraction routines of QuickPDF do not re-assemble the words. These routines merely extract the text as it is stored in the document and tell you where it should appear on the page and how it should be formatted. <O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><O:P>&nbsp;</O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">A text extraction routine that is smart enough to re-assemble the words and tell me their origins would be a terrific enhancement to the library but as far as I know no such routine exists here today. <O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><O:P>&nbsp;</O:P></SPAN></P><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">If Debenu does not add such a feature soon (hint, hint Karl;-) </SPAN><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">) I will probably have to write one of my own.</SPAN><O:P></O:P></P></SPAN></SPAN><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><O:P>&nbsp;</O:P></SPAN><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt"><SPAN style="mso-spacerun: yes">&nbsp;</SPAN><O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">Best luck to you,<O:P></O:P></SPAN></P><P style="MARGIN: 0in 0in 0pt" ="Ms&#111;normal"><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10pt">Steve,<O:P></O:P></SPAN></P><span style="font-size:10px"><br /><br />Edited by swb1 - 12 Aug 09 at 9:31PM</span>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 21:21:55 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5412.html#5412</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text :   So the best way is to use option...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5411.html#5411</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1137">AIM</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 8:40PM<br /><br /><P><table width="99%"><tr><td class="BBquote"><img src="forum_images/quote_box.png" title="Quote" alt="Quote" style="vertical-align: text-bottom;" /> So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits.</td></tr></table></P><P>Do you have any code snippets or demos?&nbsp; <BR>But I think that this is a too complicated way for a functionality like extracting text that should be part of QuickPDF (where it nearly works).</P><P>I'm sorry, but I still believe that this is a bug in QuickPDF&nbsp;<img src="http://www.quickpdf.org/forum/smileys/smiley18.gif" height="17" width="17" border="0" alt="Ouch" title="Ouch" /> and if the "Ingo" example #2 would behave like example #1 and #3, there wouldn't be a problem and everything would work perfectly.</P><P>JFYI, I tried your pdftext.dll and it has the same problem! Here is the output of your DLL&nbsp;<img src="http://www.quickpdf.org/forum/smileys/smiley9.gif" height="17" width="17" border="0" alt="Embarrassed" title="Embarrassed" /><BR>&nbsp;<BR><table width="99%"><tr><td><pre class="BBcode">page 1 / 1<BR>&nbsp;<BR>Quick DF Library.<BR>P<BR></pre></td></tr></table><BR></P>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 20:40:07 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5411.html#5411</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text : Hi Martin!So the best way is to...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5409.html#5409</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 7:16PM<br /><br />Hi Martin!<br><br>So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits. The first PSUnit is at the bottom of the page on the left side. Each thing (pictures, textstrings, ...) can be put on this page at anytime. The coordinates inside the pdf says where the objects shall appear. Please keep in mind that below the surface of the pdf it doesn't look as nice as later in the pdf-reader ;-)<br><br>Cheers, Ingo<br>&nbsp;<br>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 19:16:24 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5409.html#5409</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text :   If you&amp;#039;re inserting first...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5407.html#5407</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1137">AIM</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 6:19PM<br /><br /><P><table width="99%"><tr><td class="BBquote"><img src="forum_images/quote_box.png" title="Quote" alt="Quote" style="vertical-align: text-bottom;" /> If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".<BR>If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".<BR>That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.<BR>If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)</td></tr></table></P><P>OK, I understand your answer, but it doesn't explain the different behavior of QuickPDF for the 3 examples I gave (I always entered the text in Open Office and colored the letters afterwards, then I created the PDF). Either two or one of them do not work correctly then.</P><P><table width="99%"><tr><td class="BBquote"><img src="forum_images/quote_box.png" title="Quote" alt="Quote" style="vertical-align: text-bottom;" /> If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ...</td></tr></table></P><P>The other options seem to be a bit buggy, Option 3 always extracts the most text (except this annoyance with single characters). </P><P>OK, back to the "<strong>Quick<FONT color=#ff0000>P</FONT>DF Library.</strong>" example.</P><P>Option 0 gives the following output:</P><P><table width="99%"><tr><td><pre class="BBcode">k&nbsp; .</pre></td></tr></table></P><P>Option 1 and 2 give the following output:</P><P><table width="99%"><tr><td><pre class="BBcode">56.80,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","k"<BR>93.50,774.10,#000000,12.00,"BAAAAA+TimesNewRomanPSMT","."</pre></td></tr></table></P><P>Option 3 gives the following output:</P><P><table width="99%"><tr><td><pre class="BBcode">"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,149.9120,776.6920,149.9120,784.7920,56.8000,784.7920,"Quick DF Library."<BR>"CAAAAA+TimesNewRomanPS-BoldMT",#000000,12.00,86.2000,776.6920,93.5200,776.6920,93.5200,784.7920,86.2000,784.7920,"P"</pre></td></tr></table></P><P>Options 0, 1 and 2 are completely useless in that example. Option 4 would work here but didn't extract as much as Option 3 from several other PDFs I have tried (so not a real solution in my case).</P><P>So what would you suggest to fully extract these two words? Or is it impossible?</P><P>Thanks for any tips,<BR>Martin</P><span style="font-size:10px"><br /><br />Edited by AIM - 12 Aug 09 at 6:21PM</span>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 18:19:23 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5407.html#5407</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text : Hi Martin!If you&amp;#039;re inserting...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5406.html#5406</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 1:39PM<br /><br />Hi Martin!<br><br>If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I".<br>If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo".<br>That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF.<br>If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-)<br>If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ... so if you want to do a textsearch you shouldn't use option 3.<br><br>Cheers, Ingo<br><br>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 13:39:17 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5406.html#5406</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text :   If i&amp;#039;m looking on your...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5405.html#5405</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1137">AIM</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 12:53PM<br /><br /><P><table width="99%"><tr><td class="BBquote"><img src="forum_images/quote_box.png" title="Quote" alt="Quote" style="vertical-align: text-bottom;" /> If i'm looking on your sample it's like sorted by row and beginning column...<BR>and this would make sense ;-)<BR>Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.</td></tr></table></P><P>Ingo, I'm not sure if I fully understand your answer. But if you have for example "<strong>In<FONT color=#ff0000>g</FONT>o</strong>" in your PDF, you would get "<strong>In o</strong>" and "<strong>g</strong>". I don't know how or why this would make sense, eg. if I want to search a PDF for "ingo".</P><P>In my opinion, "In" + "g" + "o" would be the only correct solution. This is at least the way how it works if two or more letters are in red.</P><P>In the meantime I also saw that it happens only with single characters in the middle of a word, not at the beginning. </P><P>OK, let's use the following examples:</P><P>- "<strong><FONT color=#ff0000>I</FONT>ngo</strong>" extracts "I" + "ngo"&nbsp;..... OK<BR>- "<strong>In<FONT color=#ff0000>g</FONT>o</strong>" extracts "In o" + "g"&nbsp;..... error in my opinion<BR>- "<strong>I<FONT color=#ff0000>ng</FONT>o</strong>" extracts "I" + "ng" + "o"&nbsp;..... OK</P><DIV>In all three tests I entered "Ingo" and colored a character in red afterwards.</DIV><DIV>&nbsp;</DIV><DIV><table width="99%"><tr><td class="BBquote"><img src="forum_images/quote_box.png" title="Quote" alt="Quote" style="vertical-align: text-bottom;" /> If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.</td></tr></table></DIV><P>Do you mean that QuickPDF should extract "ckPDF Library" + "Qui" ?</P><P>But in that case you get "Qui" + "ckPDF Library" (what is correct in my opinion).</P><P>Thanks,<BR>Martin</P>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 12:53:34 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5405.html#5405</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text : Hi AIM!If i&amp;#039;m looking on...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5404.html#5404</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=111">Ingo</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 10:43AM<br /><br />Hi AIM!<br><br>If i'm looking on your sample it's like  sorted by row and beginning column...<br>and this would make sense ;-)<br>Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string.<br><br>If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string.<br><br>Cheers, Ingo<br>&nbsp;<br>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 10:43:47 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5404.html#5404</guid>
  </item> 
  <item>
   <title><![CDATA[Bug when extracting text : Hi, I use QuickPDF 7.15 with...]]></title>
   <link>http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5403.html#5403</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="http://www.quickpdf.org/forum/member_profile.asp?PF=1137">AIM</a><br /><strong>Subject:</strong> 1171<br /><strong>Posted:</strong> 12 Aug 09 at 10:09AM<br /><br /><P>Hi,</P><P>I use QuickPDF 7.15 with Option #3 to extract text from PDF files and ran into an annoying bug.</P><P>Create a simple PDF file that contains the text "QuickPDF Library" and use another color for the character "P".</P><P>Then QuickPDF extracts the following content from "<strong>Quick<FONT color=#ff0000>P</FONT>DF Library</strong>":</P><P><table width="99%"><tr><td><pre class="BBcode">"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,147.3920,776.6920,147.3920,784.7920,56.8000,784.7920,"Quick DF Library"<BR>"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,92.8720,776.6920,92.8720,784.7920,86.2000,784.7920,"P"</pre></td></tr></table></P><P>As you can see, "P" is extracted after "Quick DF Library" with a missing "P", but the output should definitely be:<BR><table width="99%"><tr><td><pre class="BBcode">...,"Quick"<BR>...,"P"<BR>...,"DF Library"</pre></td></tr></table></P><P>When you use however more than one character in another color, then it works correctly. Use another color for "PD", then the text extraction from "<strong>Quick<FONT color=#ff0000>PD</FONT>F Library</strong>" works in the correct order:</P><P><table width="99%"><tr><td><pre class="BBcode">"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.8000,776.6920,86.1040,776.6920,86.1040,784.7920,56.8000,784.7920,"Quick"<BR>"BAAAAA+TimesNewRomanPSMT",#FF0000,12.00,86.2000,776.6920,101.5600,776.6920,101.5600,784.7920,86.2000,784.7920,"PD"<BR>"BAAAAA+TimesNewRomanPSMT",#000000,12.00,101.5000,776.6920,147.1960,776.6920,147.1960,784.7920,101.5000,784.7920,"F Library"</pre></td></tr></table></P><P>So it seems that this happens only for single characters. Any chance to get this fixed in the next version?</P>]]>
   </description>
   <pubDate>Wed, 12 Aug 2009 10:09:48 +0000</pubDate>
   <guid isPermaLink="true">http://www.quickpdf.org/forum/bug-when-extracting-text_topic1171_post5403.html#5403</guid>
  </item> 
 </channel>
</rss>