I need help - I can help - Height of the extracted text

Print Page | Close Window

Height of the extracted text

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2376
Printed Date: 30 Jun 25 at 11:35PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Height of the extracted text

Posted By: emgi
Subject: Height of the extracted text
Date Posted: 21 Aug 12 at 11:10AM

Is it possible to get the real text bounded box using the text extraction functions ?

The values returned by GetPageText() function are the "maximum" values for the font.

The height of an extracted text determined by the "csv" string is bigger than the rendered text.

thanks for your help

Replies:

Posted By: Ingo
Date Posted: 21 Aug 12 at 2:20PM

Hi emgi!

If you use the extract option "word by word" then the font height should be correct.
Or you should have a look on the x-/y-values for the string-boxes.
Have a look in the online reference here:
http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php

Cheers and welcome here,
Ingo

Posted By: emgi
Date Posted: 21 Aug 12 at 2:43PM

Hi Ingo,
Thank you for your response.

That's what i do (as we can see in code below)

But, the boxes (in blue) are higher than the rendered words (in red) .

String txt = pdf.GetPageText(4);

Regards

Posted By: Ingo
Date Posted: 21 Aug 12 at 3:15PM

So you should substract a little bit.
Make some tries for matching percentage.
Where's the problem?
If you think it's an error you should post it on the official support pages.
This here is the user-user-forum.
QP is a stable library with many years of development now - i've never had a similar question like yours ;-)

Cheers, Ingo

Posted By: emgi
Date Posted: 21 Aug 12 at 3:43PM

Thank you so.
Sure that QuickPdfLib is stable library i'm using it from long time ago with success !
I don't think that is a bug but i had never do that before.
So, i will do some other tests and post my question on the official support pages.
Best regards,

Emmanuel

Posted By: AndrewC
Date Posted: 29 Aug 12 at 3:11AM

Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent. Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.

I am wondering why you need the actual character bounding boxes of each word ?

Andrew.

Posted By: emgi
Date Posted: 29 Aug 12 at 6:36AM

Hi Andrew,

I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages.

That's why i need these data.

Regards,

Emmanuel

Posted By: AndrewC
Date Posted: 29 Aug 12 at 11:38AM

I have just realised that the individual character bounding boxes are not easily available in the font files. We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows.

Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values.

The character widths are freely available directly from the PDF structure itself. The character bounding boxes would need to be extracted from each different font type. This would also slow down the rendering process also.

It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment.

Andrew.

Posted By: emgi
Date Posted: 29 Aug 12 at 2:12PM

Thank you for your answer.
It would be really useful for my tool.
It is a tool to detect and verify the content of various documents.
To do this, the user defines graphal areas and a list of rules for each area.

My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long.

Regards,
Emmanuel

Posted By: AndrewC
Date Posted: 29 Aug 12 at 2:18PM

If it is graphical then I suspect you are rendering the PDF to an image. You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace. You can then adjust the values from QPL by the whitespace values that you have calculated.

Andrew.

Posted By: emgi
Date Posted: 29 Aug 12 at 2:22PM

It is quite that !