Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage

Forum Home

Forum Home > For Users of the Library > I need help - I can help

New Posts

RSS Feed - Height of the extracted text

FAQ

FAQ

Register

Login

Height of the extracted text

Post Reply

Author

Topic Search

Topic Options

Topic Options

Create New Topic

Printable Version

Translate Topic

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Topic: Height of the extracted text
Posted: 21 Aug 12 at 11:10AM

Is it possible to get the real text bounded box using the text extraction functions ?

The values returned by GetPageText() function are the "maximum" values for the font.

The height of an extracted text determined by the "csv" string is bigger than the rendered text.

thanks for your help

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3524

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 21 Aug 12 at 2:20PM

Hi emgi!

If you use the extract option "word by word" then the font height should be correct.
Or you should have a look on the x-/y-values for the string-boxes.
Have a look in the online reference here:
http://www.quickpdflibrary.com/help/quickpdf/ExtractFilePageText.php

Cheers and welcome here,
Ingo

Edited by Ingo - 21 Aug 12 at 2:21PM

Back to Top

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 21 Aug 12 at 2:43PM

Hi Ingo,
Thank you for your response.

That's what i do (as we can see in code below)

But, the boxes (in blue) are higher than the rendered words (in red) .

String txt = pdf.GetPageText(4);

Regards

Edited by emgi - 21 Aug 12 at 2:44PM

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3524

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 21 Aug 12 at 3:15PM

So you should substract a little bit.
Make some tries for matching percentage.
Where's the problem?
If you think it's an error you should post it on the official support pages.
This here is the user-user-forum.
QP is a stable library with many years of development now - i've never had a similar question like yours ;-)

Cheers, Ingo

Back to Top

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 21 Aug 12 at 3:43PM

Thank you so.
Sure that QuickPdfLib is stable library i'm using it from long time ago with success !
I don't think that is a bug but i had never do that before.
So, i will do some other tests and post my question on the official support pages.
Best regards,

Emmanuel

Edited by emgi - 21 Aug 12 at 4:04PM

Back to Top

AndrewC

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote AndrewC

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 3:11AM

Quick PDF Library returns the full font cell height. The cell height is defined as the Font Ascent + Font Descent. Using these values makes it much easier to group characters and into words and words into lines for the advanced text extraction options.

I am wondering why you need the actual character bounding boxes of each word ?

Andrew.

Back to Top

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 6:36AM

Hi Andrew,

I'm writing a tool to capture and analyse text that uses graphical areas on rendered pages.

That's why i need these data.

Regards,

Emmanuel

Back to Top

AndrewC

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote AndrewC

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 11:38AM

I have just realised that the individual character bounding boxes are not easily available in the font files. We don't need to use the individual character heights when rendering fonts as this is taken care of by the font renderer built in to Windows.

Every font has a different way of storing this information and it would take some considerable effort to extract and store the required values.

The character widths are freely available directly from the PDF structure itself. The character bounding boxes would need to be extracted from each different font type. This would also slow down the rendering process also.

It would not be a quick fix to extract this information and it is very unlikely that I can get the developers to implement this feature at the moment.

Andrew.

Edited by AndrewC - 29 Aug 12 at 2:04PM

Back to Top

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 2:12PM

Thank you for your answer.
It would be really useful for my tool.
It is a tool to detect and verify the content of various documents.
To do this, the user defines graphal areas and a list of rules for each area.

My other solution is to analyze the rendered image and thereby deduce the character size. However, the processing time may be very long.

Regards,
Emmanuel

Edited by emgi - 29 Aug 12 at 2:14PM

Back to Top

AndrewC

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote AndrewC

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 2:18PM

If it is graphical then I suspect you are rendering the PDF to an image. You could use this image and the bounding box to extract the word into a smaller image and then analyse the smaller image to find the extent of the whitespace. You can then adjust the values from QPL by the whitespace values that you have calculated.

Andrew.

Back to Top

emgi

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 21 Aug 12
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote emgi

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 Aug 12 at 2:22PM

It is quite that !

Back to Top

Post Reply
Tweet

Forum Jump

Forum Permissions View Drop Down

View Drop Down

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store