Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - DASetTextExtractionWordGap
  FAQ FAQ  Forum Search   Register Register  Login Login

DASetTextExtractionWordGap

 Post Reply Post Reply
Author
Message
Papajin View Drop Down
Beginner
Beginner
Avatar

Joined: 08 Feb 12
Location: Chicago, IL USA
Status: Offline
Points: 3
Post Options Post Options   Thanks (0) Thanks(0)   Quote Papajin Quote  Post ReplyReply Direct Link To This Post Topic: DASetTextExtractionWordGap
    Posted: 08 Feb 12 at 8:16PM
I'm using library version 8.13 and I'm extracting text from pdf's using DAExtractPageText.  I've run into a few documents that aren't extracting quite the way I'd like using using an "Options" setting of 3 (3 = Return a CSV string for each piece of text on the page with the following format:  Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text).  It's putting some "words" together that shouldn be separate.  In order to address this issue, I figured I'd use the fairly new DASetTextExtractionWordGap function to try and clean things up a bit.

Unfortunately so far, I haven't been able to get this function to have any impact on what's being extracted at all.  Has anybody had any success using this command and if so, what sort of wordgap values were you using?  By default I've been using 0.7, which I _think_ is what the default is, but I'm not 100% certain of that anymore.  Adjusting that value both higher and lower seems to have no impact.  What's the trick to getting this to work?

Thanks!
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 09 Feb 12 at 5:57AM
There are some new PDF's that have become more common lately that have no space character defined when the text is drawn.  The PDF is just placing the word in their correct location and is not relying on the space character.  This forces Quick PDF Library to guess where the spaces are and this gets quite tricky.

The text extraction code is handling many different types of PDF's quite well now.  If you look inside the PDF text drawing commands of some PDF's you would wonder how the text and words can be put back together.

We have added some improved code for this into the 8.14 betas and it fixes most of these problem PDFs.  Quick PDF Beta 4 can be downloaded from

 - http://www.quickpdflibrary.com/blog/2012/02/quick-pdf-library-8-14-beta-4-released/

I will be interested to see if this fixes your problem.

Andrew.

Back to Top
Papajin View Drop Down
Beginner
Beginner
Avatar

Joined: 08 Feb 12
Location: Chicago, IL USA
Status: Offline
Points: 3
Post Options Post Options   Thanks (0) Thanks(0)   Quote Papajin Quote  Post ReplyReply Direct Link To This Post Posted: 11 Feb 12 at 8:17PM
Originally posted by AndrewC AndrewC wrote:

There are some new PDF's that have become more common lately that have no space character defined when the text is drawn.  The PDF is just placing the word in their correct location and is not relying on the space character.  This forces Quick PDF Library to guess where the spaces are and this gets quite tricky.


Trust me, I know this quite well. :)

I went ahead and redid the documents I was working with using the word-based extraction and put all the "phrases" together myself, so I have some idea what a pain it is.  My advantage though was that I could tailor the spacing for my specific need, so it was less likely to put spaces where it shouldn't or vice versa.  Still, I prefer to use the built-in version when I can as it makes things easier on my end.

Quote We have added some improved code for this into the 8.14 betas and it fixes most of these problem PDFs.  Quick PDF Beta 4 can be downloaded from

 - http://www.quickpdflibrary.com/blog/2012/02/quick-pdf-library-8-14-beta-4-released/

I will be interested to see if this fixes your problem.


Thanks!  I'm doing a new set of documents that are having similar issues, and I'll give it a try.


Edited by Papajin - 11 Feb 12 at 8:42PM
Back to Top
Papajin View Drop Down
Beginner
Beginner
Avatar

Joined: 08 Feb 12
Location: Chicago, IL USA
Status: Offline
Points: 3
Post Options Post Options   Thanks (0) Thanks(0)   Quote Papajin Quote  Post ReplyReply Direct Link To This Post Posted: 12 Feb 12 at 7:05PM
Yes, I can confirm that the 8.14 beta worked MUCH better than 8.13.  Many thanks!
Back to Top
HNRSoftware View Drop Down
Senior Member
Senior Member


Joined: 13 Feb 11
Location: Washington, USA
Status: Offline
Points: 88
Post Options Post Options   Thanks (0) Thanks(0)   Quote HNRSoftware Quote  Post ReplyReply Direct Link To This Post Posted: 27 May 12 at 6:49PM
I am decoding some pretty complex, high-density text pages.  Library version 8.13 did a pretty good job, but not perfect, so I upgraded to 8.15 today.  8.15 is definitely better, but there are still some puzzling aspects to the ones that didn't come out clean.

I can't get any change in the output by using SetTextExtractionWordGap or the DA version, thinking that there might be some difference in internal processing.  No matter what I set the wordgap to, it gives the same results.  I see two likely interpretations.  First would be that the processing never got to a point where it needed to examine a string based on word gap.  The other would be that it is getting tested, but not correctly.

I can certainly provide the test file, but I can describe the problem areas as ones (in a pretty small font) that appear to Adobe Reader (and also QPDF rendering) as:

"             M  J  J  A  S  O  N  D  J"        
"to Buy   0  0  0  0  0  0  0  1  0"
"Options 0  0  1  1  0  5  3 10 1"
"to Sell    0  0  0 1  0   0  2  0  1"

There are some minor differences because of trying to show this in the Forum font, but you get the picture.  This gets extracted as:

"M J JASONDJ"         - as a single string entry.  The related lines below it are interpreted as
"to Buy 000000010"
"Options0 0 1 1 0 5 3101"
"to Sell 000100201"

Because of the interesting interpretations of these as strings, it looks to me as if it is trying to do something with a word-gap type of processing.

Another oddity is that the string returns are very slightly different between mode 0,3 and 4.

mode 0, the string comes back as "Options 0 1 1 0 5 3101"
mode 3  returns two strings: "Options" and "0011053 10 1"
mode 4 returns four "Options" "0011053" "10" and "1", and these stay the same no matter what I set the word gap to.

Interestingly, "to Buy" gets correctly broken up by mode 4 into "to" and "Buy", but the gap between those two word is visibly quite a bit less than the gaps between the digits on the rest of the line.

One possibility is that the processing doesn't like single-character "words" and assumes they have been kerned, or somehow spaced out, but still belong together.

I'm not sure what all this means, but it is one of very few flaws in otherwise very impressive processing.

Thanks - Howard

Back to Top
HNRSoftware View Drop Down
Senior Member
Senior Member


Joined: 13 Feb 11
Location: Washington, USA
Status: Offline
Points: 88
Post Options Post Options   Thanks (0) Thanks(0)   Quote HNRSoftware Quote  Post ReplyReply Direct Link To This Post Posted: 27 May 12 at 7:02PM
Significant test - I was able to get "to Buy" to join together in mode 4 with a wordgap setting of 0.9, instead of the default 0.7.  I don't have fine enough ruler, but the gap on the printed page looks like about half a millimeter - extremely small, but apparantly about the difference between wordgap setting of 0.8 and 0.9.

The documentation talks about the wordgap at a ratio between the gap and the text height.  That "feels" reasonable for this example, but it must not come into play for the single digit lines because the gap ratio is much larger in those cases.

To me, the important thing is that setting the word gap does have a (reasonable) effect on the processing.
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 28 May 12 at 5:55AM
I gets very difficult to try and determine what the correct width of a space character.

Both Acrobat and Quick PDF Library have to make a best guess as many PDF's place the characters individually and leave it up to the extraction process to determine whether a gap is 0, 1 or more spaces.  WordGap will allow you to fine tune what is defined as a space and what is not.  The default of 0.7 has changed between versions but it actually translates to 0.105 * text height.  So if a gap is larger than 10% of the cell height then it is considered to be a space character.  There are some overriding factors such as if there is an implicit space defined in the output then it will be maintained.  It is only when spaces are not drawn or when each character is draw individually that we need to rely a little more on wordgap.  Setting word gap to 1.4 will double the allowed width of a space to 20% of character height. 

It is not possible to get 100% accuracy on all PDF's and as mentioned QPL does a pretty good job even compared to Acrobat.  QPL is using OCR like techniques to determine words and character spacing and so cannot be 100% correct for all PDF's. 

Option 0 is not as exact as options 3 - 8.  Options 3 - 8 all use the same internal logic to extract the results.
Back to Top
HNRSoftware View Drop Down
Senior Member
Senior Member


Joined: 13 Feb 11
Location: Washington, USA
Status: Offline
Points: 88
Post Options Post Options   Thanks (0) Thanks(0)   Quote HNRSoftware Quote  Post ReplyReply Direct Link To This Post Posted: 28 May 12 at 2:10PM
Hi Andrew - thank you for the clarifications.  As I said, QPL already does an amazing job, so I'm really just trying to figure out why it isn't perfect.  I do NOT have strong familiarity with the internals of pdf structure, but, what I do know makes me not want to dig deeper than I have to.

The puzzling aspect of this is that the strings like "Options 0  0  1  1  0  5  3 10 1" really do look like that in the Adobe and QPL rendering.  As I said, very small font, so measuring is impractical, but the space is clearly larger than the character height, so it would really seem like a space should be imputed.  Additionally,  the fact that the three similar lines detect some spaces but not others is odd.  The fact that changing the wordgap doesn't have any effect on these strings at all is odd - there is obviously more going on here.

I will play with ExtractFilePageContentToString a little more and see if I can get clues as to why there is trouble.  If I can decode it enough to locate the strings in question, I may get some clues.

This is not a big problem to me, but it nags at me.  These strings are not ones I really need to decode, and the other 98% of the extraction works fine.

Thanks - Howard
Back to Top
HNRSoftware View Drop Down
Senior Member
Senior Member


Joined: 13 Feb 11
Location: Washington, USA
Status: Offline
Points: 88
Post Options Post Options   Thanks (0) Thanks(0)   Quote HNRSoftware Quote  Post ReplyReply Direct Link To This Post Posted: 28 May 12 at 3:36PM
Hi Andrew - I really hate to ask, but Google searches on pdf internal structure are turning up nothing useful.  I just need some sort of clue as to how to parse ExtractFilePageContentToString results.  Is this pretty much the internal pdf structure, or is this more of an intermediate level of QPL processing? 

 It looks a lot like each line is a separate command element and the last two characters on the line are a "command".  The rest of the line is probably decoded based on the "command"

I don't know if you recall "magic pictures" of the 1980s or 90s which are a page of various dots, and if you stare at them a certain way, you see a picture.  That is what this seems like.  I can almost see the structure, but not quite.

If I am on the right track, please just reply "yes".  If you have a handy link, I would appreciate it, but this is way beyond anything you should have to spend any time on.  I am pursuing this to keep my brain exercised - it just "feels" like something useful.

Thanks - Howard

PS. I did locate some of the strings in question and I see the difficulty in interpreting them.
Back to Top
HNRSoftware View Drop Down
Senior Member
Senior Member


Joined: 13 Feb 11
Location: Washington, USA
Status: Offline
Points: 88
Post Options Post Options   Thanks (0) Thanks(0)   Quote HNRSoftware Quote  Post ReplyReply Direct Link To This Post Posted: 29 May 12 at 12:35AM
I figured out most of the ExtractFilePageContentToString.  It has to be one of your intermediate level outputs rather than raw pdf because it is amazingly easy to understand once I did a little guessing.  Is this format used for input to your rendering code? or is it something else?  Between it and the mode 4 text output I can create a surprisingly good rendering to a bitmap.  (I know - you already provide that, but this gives me a way to quickly visually verify that I am interpreting the various strings correctly).  It has been very entertaining.  I think I can now go back to work and efficiently extract the pdf text strings that I was interested in in the first place.  Once again, thanks for an excellent product.  Howard
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store