Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Height of the extracted text
  FAQ FAQ  Forum Search   Register Register  Login Login

Height of the extracted text

 Post Reply Post Reply
Author
Message
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Topic: Height of the extracted text
    Posted: 21 Aug 12 at 11:10AM
Is it possible to get the real text bounded box using the text extraction functions ?
The values returned by GetPageText() function are the "maximum" values for the font.
The height of an extracted text determined by the "csv" string is bigger than the rendered text.
thanks for your help
 
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 21 Aug 12 at 2:20PM
Hi emgi!

If you use the extract option "word by word" then the font height should be correct.
Or you should have a look on the x-/y-values for the string-boxes.
Have a look in the online reference here:
http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php

Cheers and welcome here,
Ingo



Edited by Ingo - 21 Aug 12 at 2:21PM
Back to Top
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Posted: 21 Aug 12 at 2:43PM

Hi Ingo,
Thank you for your response.

That's what i do (as we can see in code below)
But, the boxes (in blue) are higher than the rendered words (in red) .
 
String txt = pdf.GetPageText(4);
 
 
Regards


Edited by emgi - 21 Aug 12 at 2:44PM
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 21 Aug 12 at 3:15PM
So you should substract a little bit.
Make some tries for matching percentage.
Where's the problem?
If you think it's an error you should post it on the official support pages.
This here is the user-user-forum.
QP is a stable library with many years of development now - i've never had a similar question like yours ;-)

Cheers, Ingo
Back to Top
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Posted: 21 Aug 12 at 3:43PM

Thank you so.
Sure that QuickPdfLib is stable library i'm using it from long time ago with success !
I don't think that is a bug but i had never do that before.
So, i will do some other tests and post my question on the official support pages.
Best regards,

Emmanuel

 


Edited by emgi - 21 Aug 12 at 4:04PM
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 3:11AM
Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent.  Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.

I am wondering why you need the actual character bounding boxes of each word ?  

Andrew.
Back to Top
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 6:36AM
Hi Andrew,
I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages.
That's why i need these data.
Regards,
Emmanuel
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 11:38AM

I have just realised that the individual character bounding boxes are not easily available in the font files.  We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows.  

Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values.  

The character widths are freely available directly from the PDF structure itself.  The character bounding boxes would need to be extracted from each different font type.  This would also slow down the rendering process also.

It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment.

Andrew.


Edited by AndrewC - 29 Aug 12 at 2:04PM
Back to Top
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 2:12PM
Thank you for your answer.
It would be really useful for my tool.
It is a tool to detect and verify the content of various documents.
To do this, the user defines graphal areas and a list of rules for each area.
 
My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long.
 
Regards,
Emmanuel

Edited by emgi - 29 Aug 12 at 2:14PM
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 2:18PM
If it is graphical then I suspect you are rendering the PDF to an image.  You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace.  You can then adjust the values from QPL by the whitespace values that you have calculated.

Andrew.
Back to Top
emgi View Drop Down
Beginner
Beginner


Joined: 21 Aug 12
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote emgi Quote  Post ReplyReply Direct Link To This Post Posted: 29 Aug 12 at 2:22PM
It is quite that !
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store