Debenu Quick PDF Library - PDF SDK Community Forum : Height of the extracted text

Height of the extracted text : It is quite that !

Wed, 29 Aug 2012 14:22:15 +0000

Author: emgi
Subject: 2376
Posted: 29 Aug 12 at 2:22PM

It is quite that !

Height of the extracted text : If it is graphical then I suspect...

Wed, 29 Aug 2012 14:18:30 +0000

Author: AndrewC
Subject: 2376
Posted: 29 Aug 12 at 2:18PM

If it is graphical then I suspect you are rendering the PDF to an image. You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace. You can then adjust the values from QPL by the whitespace values that you have calculated.

Andrew.

Height of the extracted text : Thank you for your answer.It...

Wed, 29 Aug 2012 14:12:36 +0000

Author: emgi
Subject: 2376
Posted: 29 Aug 12 at 2:12PM

Thank you for your answer.
It would be really useful for my tool.
It is a tool to detect and verify the content of various documents.
To do this, the user defines graphal areas and a list of rules for each area.

My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long.

Regards,
Emmanuel

Edited by emgi - 29 Aug 12 at 2:14PM

Height of the extracted text : I have just realised that the...

Wed, 29 Aug 2012 11:38:47 +0000

Author: AndrewC
Subject: 2376
Posted: 29 Aug 12 at 11:38AM

I have just realised that the individual character bounding boxes are not easily available in the font files. We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows.

Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values.

The character widths are freely available directly from the PDF structure itself. The character bounding boxes would need to be extracted from each different font type. This would also slow down the rendering process also.

It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment.

Andrew.

Edited by AndrewC - 29 Aug 12 at 2:04PM

Height of the extracted text : Hi Andrew,I'm writing...

Wed, 29 Aug 2012 06:36:32 +0000

Author: emgi
Subject: 2376
Posted: 29 Aug 12 at 6:36AM

Hi Andrew,

I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages.

That's why i need these data.

Regards,

Emmanuel

Height of the extracted text : Quick PDF Library returns the...

Wed, 29 Aug 2012 03:11:23 +0000

Author: AndrewC
Subject: 2376
Posted: 29 Aug 12 at 3:11AM

Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent. Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.

I am wondering why you need the actual character bounding boxes of each word ?

Andrew.

Height of the extracted text : Thank you so.Sure that QuickPdfLib...

Tue, 21 Aug 2012 15:43:09 +0000

Author: emgi
Subject: 2376
Posted: 21 Aug 12 at 3:43PM

Thank you so.
Sure that QuickPdfLib is stable library i'm using it from long time ago with success !
I don't think that is a bug but i had never do that before.
So, i will do some other tests and post my question on the official support pages.
Best regards,

Emmanuel

Edited by emgi - 21 Aug 12 at 4:04PM

Height of the extracted text : So you should substract a little...

Tue, 21 Aug 2012 15:15:48 +0000

Author: Ingo
Subject: 2376
Posted: 21 Aug 12 at 3:15PM

So you should substract a little bit.
Make some tries for matching percentage.
Where's the problem?
If you think it's an error you should post it on the official support pages.
This here is the user-user-forum.
QP is a stable library with many years of development now - i've never had a similar question like yours ;-)

Cheers, Ingo

Height of the extracted text : Hi Ingo,Thank you for...

Tue, 21 Aug 2012 14:43:48 +0000

Author: emgi
Subject: 2376
Posted: 21 Aug 12 at 2:43PM

Hi Ingo,
Thank you for your response.

That's what i do (as we can see in code below)

But, the boxes (in blue) are higher than the rendered words (in red) .

String txt = pdf.GetPageText(4);

Regards

Edited by emgi - 21 Aug 12 at 2:44PM

Height of the extracted text : Hi emgi!If you use the extract...

Tue, 21 Aug 2012 14:20:16 +0000

Author: Ingo
Subject: 2376
Posted: 21 Aug 12 at 2:20PM

Hi emgi!

If you use the extract option "word by word" then the font height should be correct.
Or you should have a look on the x-/y-values for the string-boxes.
Have a look in the online reference here:
http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php

Cheers and welcome here,
Ingo

Edited by Ingo - 21 Aug 12 at 2:21PM