Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Bug when extracting text |
Post Reply |
Author | |||||
AIM
Beginner Joined: 12 Aug 09 Status: Offline Points: 10 |
Post Options
Thanks(0)
Posted: 12 Aug 09 at 10:09AM |
||||
Hi, I use QuickPDF 7.15 with Option #3 to extract text from PDF files and ran into an annoying bug. Create a simple PDF file that contains the text "QuickPDF Library" and use another color for the character "P". Then QuickPDF extracts the following content from "QuickPDF Library":
As you can see, "P" is extracted after "Quick DF Library" with a missing "P", but the output should definitely be:
When you use however more than one character in another color, then it works correctly. Use another color for "PD", then the text extraction from "QuickPDF Library" works in the correct order:
So it seems that this happens only for single characters. Any chance to get this fixed in the next version? |
|||||
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
||||
Hi AIM!
If i'm looking on your sample it's like sorted by row and beginning column... and this would make sense ;-) Instead it's so: First in - first out ... Last in - last out... and it doesn't matter where's the position of a string. If you would insert "QuickPDF Library" and if you would make "Qui" in red later then "Qui" will be the last string. Cheers, Ingo |
|||||
AIM
Beginner Joined: 12 Aug 09 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
||||
Ingo, I'm not sure if I fully understand your answer. But if you have for example "Ingo" in your PDF, you would get "In o" and "g". I don't know how or why this would make sense, eg. if I want to search a PDF for "ingo". In my opinion, "In" + "g" + "o" would be the only correct solution. This is at least the way how it works if two or more letters are in red. In the meantime I also saw that it happens only with single characters in the middle of a word, not at the beginning. OK, let's use the following examples: - "Ingo" extracts "I" + "ngo" ..... OK In all three tests I entered "Ingo" and colored a character in red afterwards.
Do you mean that QuickPDF should extract "ckPDF Library" + "Qui" ? But in that case you get "Qui" + "ckPDF Library" (what is correct in my opinion). Thanks, |
|||||
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
||||
Hi Martin!
If you're inserting first "ngo" and then "I" ... the extraction will be first string "ngo" and second string "I". If you're inserting first "I" and then "ngo" ... the extraction will be first string "I" and second string "ngo". That's the way pdf-text-contents will be managed. This has nothing to do with QuickPDF. If you're writing a whole page with text and at the end you're inserting a single character at the top, left position... the extraction WITH OPTION 3 will extract these character as the very last string... First in first out ;-) If you're using option 0 for example you can avoid this behavior. Option 0 concatenate the strings like they should be ... so if you want to do a textsearch you shouldn't use option 3. Cheers, Ingo |
|||||
AIM
Beginner Joined: 12 Aug 09 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
||||
OK, I understand your answer, but it doesn't explain the different behavior of QuickPDF for the 3 examples I gave (I always entered the text in Open Office and colored the letters afterwards, then I created the PDF). Either two or one of them do not work correctly then.
The other options seem to be a bit buggy, Option 3 always extracts the most text (except this annoyance with single characters). OK, back to the "QuickPDF Library." example. Option 0 gives the following output:
Option 1 and 2 give the following output:
Option 3 gives the following output:
Options 0, 1 and 2 are completely useless in that example. Option 4 would work here but didn't extract as much as Option 3 from several other PDFs I have tried (so not a real solution in my case). So what would you suggest to fully extract these two words? Or is it impossible? Thanks for any tips, Edited by AIM - 12 Aug 09 at 6:21PM |
|||||
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
||||
Hi Martin!
So the best way is to use option 3 and concatenate the single strings together regarding the values for row and column. A pdf-page is created as a 842 x 595 matrix. These single points are called PSUnits. The first PSUnit is at the bottom of the page on the left side. Each thing (pictures, textstrings, ...) can be put on this page at anytime. The coordinates inside the pdf says where the objects shall appear. Please keep in mind that below the surface of the pdf it doesn't look as nice as later in the pdf-reader ;-) Cheers, Ingo |
|||||
AIM
Beginner Joined: 12 Aug 09 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
||||
Do you have any code snippets or demos? I'm sorry, but I still believe that this is a bug in QuickPDF and if the "Ingo" example #2 would behave like example #1 and #3, there wouldn't be a problem and everything would work perfectly. JFYI, I tried your pdftext.dll and it has the same problem! Here is the output of your DLL
|
|||||
swb1
Debenu Quick PDF Library Expert Joined: 05 Dec 05 Location: United States Status: Offline Points: 100 |
Post Options
Thanks(0)
|
||||
Martin, Ingo is correct in that this is not a bug but rather the nature of the way that the PDF is constructed. There is no rule that says text that appears to be one word when displayed by Acrobat or rendered by QuickPDF actually be stored as one word inside the PDF. I have seen PDFs that were constructed one letter at a time! Each letter would appear as a single text element complete with location and font information. While this is an extremely inefficient way to construct a PDF it works nonetheless and appears to be just fine from the outside when rendered by Acrobat. The text extraction routines of QuickPDF do not re-assemble the words. These routines merely extract the text as it is stored in the document and tell you where it should appear on the page and how it should be formatted. A text extraction routine that is smart enough to re-assemble the words and tell me their origins would be a terrific enhancement to the library but as far as I know no such routine exists here today. If Debenu does not add such a feature soon (hint, hint Karl;-) ) I will probably have to write one of my own. Best luck to you, Steve, Edited by swb1 - 12 Aug 09 at 9:31PM |
|||||
AIM
Beginner Joined: 12 Aug 09 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
||||
Thanks for all your information and suggestions, I think I understood now that my real "problem" of these 3 examples happens at PDF creation time. Seems that I have to invest some time and implement a fully working text extraction myself... |
|||||
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store