I need help - I can help - Extract text next to a Tag field

Extract text next to a Tag field

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2623
Printed Date: 08 Jan 26 at 9:47PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Extract text next to a Tag field

Posted By: chrisreed
Subject: Extract text next to a Tag field
Date Posted: 29 Apr 13 at 6:42AM

I am looking to trial QuickPDF to see if it can extract text values from a medical report as follows:

Exam Date: 23/11/2011 DOB: 26/03/1966 MRN: C1234567

Referring Dr: A, Smith Sonographer: G Perry

etc....

The idea is to locate certain predefined Tag Fields in bold and then read the text value next to them.

eg. I would use some QuickPDF function to search for the Tag Exam Date: and then read in the text value right next to this (23/11/2011).

Is this something that QuickPDF can do and what function would I call?

Thanks Chris.

Replies:

Posted By: Ingo
Date Posted: 29 Apr 13 at 7:30AM

Hi Chris!

This looks as if the "tag-text" is always at the same place.

In this case you can use the extract functionalities. They offer

an additional option to extract text with position data. So it's

possible for you to determine detailed the string position

you wanna see.

Another point of view: Search with "pos" (=Delphi) through

the extracted textcontent of a page for your tags and take

the text following behind.

This function you can use for my my ideas:

http://www.quickpdflibrary.com/help/quickpdf/GetPageText.php - http://www.quickpdflibrary.com/help/quickpdf/GetPageText.php

Cheers and welcome here,

Ingo

Posted By: AndrewC
Date Posted: 29 Apr 13 at 7:47AM

Chris,

As Ingo suggests, the easiest option to get working would be to use GetPageText(7) to get the formatted raw text and then do some string searching to find the text you need. QPL has no such concept as "near to" or "to the right of".

Andrew.

Posted By: chrisreed
Date Posted: 29 Apr 13 at 12:23PM

Yes the Header text will always be in the same position, but there are other Tags further down in the report which can be anywhere. I was hoping to search on certain key tag names that were also in BOLD format so I don't accidentally choose similar text in the main report.

It looks like I can use SetTextExtractionOptions to get the details of the word formats (Font, Colour, Size etc..) but I wonder if that also includes whether it is bold or not?

Anyway thanks for that info. I will install the trial version and give it a go.

Posted By: chrisreed
Date Posted: 30 Apr 13 at 10:04AM

Well I got it partially working but I have come across a few things I don't understand:

If I use the function LoadFromFile to open a PDF then the HasFontResources function returns "1" (ie. PDF document has NOT been scanned in and so has readable text).

If I open the same file using DAOpenFile or DAOpenFileReadOnly it returns a "0" why is this?

Also is the general idea to use DA Functions only with other DA Functions....

ie. DAOpenFileReadOnly -> DASetTextExtractionOptions -> DAExtractPageText

and LoadFromFile -> SetTextExtractionOptions -> ExtractFilePageText

or can we mix and match between the different functions?

Chris

Posted By: Ingo
Date Posted: 30 Apr 13 at 10:44AM

Hi Chris!

Don't mix DA- and non-DA-functions ;-)

DA-functions need less memory - so it's

good to avoid memory leaks while working

on large documents.

Cheers, Ingo

Posted By: chrisreed
Date Posted: 30 Apr 13 at 10:49AM

Thanks for that Ingo - any idea about the HasFontResources problem?

Chris

Posted By: Ingo
Date Posted: 30 Apr 13 at 10:56AM

Hi Chris!

Which problem?

HasFontResources is a non-DA-function - so don't use it with DAOpen...

A DA-function begins with "DA" ;-)

Cheers, Ingo

Posted By: chrisreed
Date Posted: 30 Apr 13 at 11:02AM

Ah! I see, but there doesn't seem to be an equivalent DAHasFontResources function, so how can I determine if a PDF has been scanned in (ie. an image) or has been created (ie. has text)?

Chris