Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Height of the extracted text |
Post Reply |
Author | |
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
Posted: 21 Aug 12 at 11:10AM |
Is it possible to get the real text bounded box using the text extraction functions ?
The values returned by GetPageText() function are the "maximum" values for the font. The height of an extracted text determined by the "csv" string is bigger than the rendered text. thanks for your help |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi emgi!
If you use the extract option "word by word" then the font height should be correct. Or you should have a look on the x-/y-values for the string-boxes. Have a look in the online reference here: http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php Cheers and welcome here, Ingo Edited by Ingo - 21 Aug 12 at 2:21PM |
|
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Hi Ingo, That's what i do (as we can see in code below) But, the boxes (in blue) are higher than the rendered words (in red) .
Regards
Edited by emgi - 21 Aug 12 at 2:44PM |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
So you should substract a little bit.
Make some tries for matching percentage. Where's the problem? If you think it's an error you should post it on the official support pages. This here is the user-user-forum. QP is a stable library with many years of development now - i've never had a similar question like yours ;-) Cheers, Ingo |
|
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Thank you so. Emmanuel Edited by emgi - 21 Aug 12 at 4:04PM |
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent. Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.
I am wondering why you need the actual character bounding boxes of each word ? Andrew.
|
|
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Hi Andrew,
I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages. That's why i need these data. Regards, Emmanuel
|
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
I have just realised that the individual character bounding boxes are not easily available in the font files. We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows. Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values. The character widths are freely available directly from the PDF structure itself. The character bounding boxes would need to be extracted from each different font type. This would also slow down the rendering process also. It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment. Andrew.
Edited by AndrewC - 29 Aug 12 at 2:04PM |
|
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
Thank you for your answer.
It would be really useful for my tool. It is a tool to detect and verify the content of various documents. To do this, the user defines graphal areas and a list of rules for each area. My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long. Emmanuel Edited by emgi - 29 Aug 12 at 2:14PM |
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
If it is graphical then I suspect you are rendering the PDF to an image. You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace. You can then adjust the values from QPL by the whitespace values that you have calculated.
Andrew.
|
|
emgi
Beginner Joined: 21 Aug 12 Status: Offline Points: 10 |
Post Options
Thanks(0)
|
It is quite that !
|
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store