Print Page | Close Window

Height of the extracted text

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2376
Printed Date: 28 Sep 24 at 10:49PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Height of the extracted text
Posted By: emgi
Subject: Height of the extracted text
Date Posted: 21 Aug 12 at 11:10AM
Is it possible to get the real text bounded box using the text extraction functions ?
The values returned by GetPageText() function are the "maximum" values for the font.
The height of an extracted text determined by the "csv" string is bigger than the rendered text.
thanks for your help
 



Replies:
Posted By: Ingo
Date Posted: 21 Aug 12 at 2:20PM
Hi emgi!

If you use the extract option "word by word" then the font height should be correct.
Or you should have a look on the x-/y-values for the string-boxes.
Have a look in the online reference here:
http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php

Cheers and welcome here,
Ingo



Posted By: emgi
Date Posted: 21 Aug 12 at 2:43PM

Hi Ingo,
Thank you for your response.

That's what i do (as we can see in code below)
But, the boxes (in blue) are higher than the rendered words (in red) .
 
String txt = pdf.GetPageText(4);
 
 
Regards


Posted By: Ingo
Date Posted: 21 Aug 12 at 3:15PM
So you should substract a little bit.
Make some tries for matching percentage.
Where's the problem?
If you think it's an error you should post it on the official support pages.
This here is the user-user-forum.
QP is a stable library with many years of development now - i've never had a similar question like yours ;-)

Cheers, Ingo


Posted By: emgi
Date Posted: 21 Aug 12 at 3:43PM

Thank you so.
Sure that QuickPdfLib is stable library i'm using it from long time ago with success !
I don't think that is a bug but i had never do that before.
So, i will do some other tests and post my question on the official support pages.
Best regards,

Emmanuel

 


Posted By: AndrewC
Date Posted: 29 Aug 12 at 3:11AM
Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent.  Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.

I am wondering why you need the actual character bounding boxes of each word ?  

Andrew.


Posted By: emgi
Date Posted: 29 Aug 12 at 6:36AM
Hi Andrew,
I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages.
That's why i need these data.
Regards,
Emmanuel


Posted By: AndrewC
Date Posted: 29 Aug 12 at 11:38AM

I have just realised that the individual character bounding boxes are not easily available in the font files.  We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows.  

Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values.  

The character widths are freely available directly from the PDF structure itself.  The character bounding boxes would need to be extracted from each different font type.  This would also slow down the rendering process also.

It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment.

Andrew.


Posted By: emgi
Date Posted: 29 Aug 12 at 2:12PM
Thank you for your answer.
It would be really useful for my tool.
It is a tool to detect and verify the content of various documents.
To do this, the user defines graphal areas and a list of rules for each area.
 
My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long.
 
Regards,
Emmanuel


Posted By: AndrewC
Date Posted: 29 Aug 12 at 2:18PM
If it is graphical then I suspect you are rendering the PDF to an image.  You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace.  You can then adjust the values from QPL by the whitespace values that you have calculated.

Andrew.


Posted By: emgi
Date Posted: 29 Aug 12 at 2:22PM
It is quite that !



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk