Print Page | Close Window

Extract text next to a Tag field

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2623
Printed Date: 01 Jul 24 at 5:19AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Extract text next to a Tag field
Posted By: chrisreed
Subject: Extract text next to a Tag field
Date Posted: 29 Apr 13 at 6:42AM
I am looking to trial QuickPDF to see if it can extract text values from a medical report as follows:
 
Exam Date: 23/11/2011     DOB: 26/03/1966     MRN: C1234567
 
Referring Dr: A, Smith      Sonographer: G Perry
 
etc....
 
The idea is to locate certain predefined Tag Fields in bold and then read the text value next to them.
 
eg. I would use some QuickPDF function to search for the Tag Exam Date: and then read in the text value right next to this (23/11/2011).
 
Is this something that QuickPDF can do and what function would I call?
 
Thanks Chris.



Replies:
Posted By: Ingo
Date Posted: 29 Apr 13 at 7:30AM
Hi Chris!
 
This looks as if the "tag-text" is always at the same place.
In this case you can use the extract functionalities. They offer
an additional option to extract text with position data. So it's
possible for you to determine detailed the string position
you wanna see.
Another point of view: Search with "pos" (=Delphi) through
the extracted textcontent of a page for your tags and take
the text following behind.
This function you can use for my my ideas:
http://www.quickpdflibrary.com/help/quickpdf/GetPageText.php - http://www.quickpdflibrary.com/help/quickpdf/GetPageText.php
 
Cheers and welcome here,
Ingo
 


Posted By: AndrewC
Date Posted: 29 Apr 13 at 7:47AM
Chris,

As Ingo suggests, the easiest option to get working would be to use GetPageText(7) to get the formatted raw text and then do some string searching to find the text you need.  QPL has no such concept as "near to" or "to the right of".

Andrew.


Posted By: chrisreed
Date Posted: 29 Apr 13 at 12:23PM
Yes the Header text will always be in the same position, but there are other Tags further down in the report which can be anywhere.  I was hoping to search on certain key tag names that were also in BOLD format so I don't accidentally choose similar text in the main report.
 
It looks like I can use SetTextExtractionOptions to get the details of the word formats (Font, Colour, Size etc..) but I wonder if that also includes whether it is bold or not?
 
Anyway thanks for that info.  I will install the trial version and give it a go.
 
 


Posted By: chrisreed
Date Posted: 30 Apr 13 at 10:04AM
Well I got it partially working but I have come across a few things I don't understand:
 
If I use the function LoadFromFile to open a PDF then the HasFontResources function returns "1" (ie. PDF document has NOT been scanned in and so has readable text).
 
If I open the same file using DAOpenFile or DAOpenFileReadOnly it returns a "0" why is this?
 
 
Also is the general idea to use DA Functions only with other DA Functions....
ie. DAOpenFileReadOnly -> DASetTextExtractionOptions -> DAExtractPageText
and LoadFromFile -> SetTextExtractionOptions -> ExtractFilePageText
 
or can we mix and match between the different functions?
 
Chris


Posted By: Ingo
Date Posted: 30 Apr 13 at 10:44AM
Hi Chris!
 
Don't mix DA- and non-DA-functions ;-)
DA-functions need less memory - so it's
good to avoid memory leaks while working
on large documents.
 
Cheers, Ingo
 


Posted By: chrisreed
Date Posted: 30 Apr 13 at 10:49AM
Thanks for that Ingo - any idea about the HasFontResources problem?
 
Chris


Posted By: Ingo
Date Posted: 30 Apr 13 at 10:56AM
Hi Chris!
 
Which problem?
HasFontResources is a non-DA-function - so don't use it with DAOpen...
A DA-function begins with "DA" ;-)
 
Cheers, Ingo
 


Posted By: chrisreed
Date Posted: 30 Apr 13 at 11:02AM
Ah! I see, but there doesn't seem to be an equivalent DAHasFontResources function, so how can I determine if a PDF has been scanned in (ie. an image) or has been created (ie. has text)?
 
Chris



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk