Print Page | Close Window

Using ExtractFilePageText with incorrect results

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1629
Printed Date: 21 Sep 24 at 2:13AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Using ExtractFilePageText with incorrect results
Posted By: gcaffe
Subject: Using ExtractFilePageText with incorrect results
Date Posted: 04 Nov 10 at 7:28PM
Hi:
 
I have an application in Delphi 2009 which use the code as follows ExtractFilePageText

          QP.LoadFromFile (edFilePathPdf.Text) / / Load the PDF in memory
          
PageCount: = QP.PageCount () / / Count the number of pages in the document
          
for i: = 1 to PageCount + 1 do begin / / Go through all the pages of the document
            
TextOutput: = TextOutput + QP.ExtractFilePageText (edFilePathPdf.Text,'', i, 3); / / Extract the text throughout the PDF page by page
          
end;

The result of TextOutput I record in a text that is displayed as well (only an excerpt):
"CourierNew" # 000000,10.00,169.0700,28.7007,169.0700,820.7007,161.2100,820.7007,161.2100,28.7007, "030EUR3744014877 2 34 04/09/10 74.00 47.44 121.44 DOMESTIC"
"CourierNew" # 000000,10.00,177.0700,28.7007,177.0700,820.7007,169.2100,820.7007,169.2100,28.7007, "030EUR3744014878 3 34 04/09/10 74.00 47.44 121.44 DOMESTIC"
"CourierNew" # 000000,10.00,185.0700,28.7007,185.0700,820.7007,177.2100,820.7007,177.2100,28.7007, "996EUR3744014889 0 234 14/09/10 36.00 30.52 0.37 0.13 0.02 66 , 37 DOMESTIC "
"CourierNew" # 000000,10.00,193.0700,28.7007,193.0700,820.7007,185.2100,820.7007,185.2100,28.7007, "996EUR3744014890 1 234 14/09/10 36.00 30.52 0.37 0.13 0.02 66 , 37 DOMESTIC "

As you can see it is easy to identify the fields by position if remove all information about the coordinates. So far right.

However, using the same code for another PDF delphi TextOutput get a different, as if executed QP.ExtractFilePageText (edFilePathPdf.Text,'', i, 4), eg

"CourierNew" # 000000,10.00,169.0700,28.7007,169.0922,154.6681,161.2322,154.6681,161.2100,28.7007, "680EUR1656635612 June 1934"
"CourierNew" # 000000,10.00,169.0700,172.6596,169.0700,220.6596,161.2100,220.6596,161.2100,172.6596, "10/06/1910"
"CourierNew" # 000000,10.00,169.0700,244.6393,169.0700,274.6393,161.2100,274.6393,161.2100,244.6393, "94.00"
"CourierNew" # 000000,10.00,169.0922,358.6066,169.0922,388.6066,161.2322,388.6066,161.2322,358.6066, "13.13"
"CourierNew" # 000000,10.00,169.0922,406.5929,169.0922,430.5929,161.2322,430.5929,161.2322,406.5929, "0.40"
"CourierNew" # 000000,10.00,169.0700,460.5776,169.0700,484.5776,161.2100,484.5776,161.2100,460.5776, "0.38"
"CourierNew" # 000000,10.00,169.0700,544.5538,169.0700,568.5538,161.2100,568.5538,161.2100,544.5538, "0.07"
"CourierNew" # 000000,10.00,169.0700,652.5233,169.0700,688.5233,161.2100,688.5233,161.2100,652.5233, "106.68"
"CourierNew" # 000000,10.00,169.0922,754.4946,169.0922,802.4946,161.2322,802.4946,161.2322,754.4946, "DOMESTIC"
"CourierNew" # 000000,10.00,177.0699,28.7006,177.0992,154.6680,169.2392,154.6680,169.2099,28.7006, "680EUR1656635630 March 1934"
"CourierNew" # 000000,10.00,177.0699,172.6595,177.0699,220.6595,169.2099,220.6595,169.2099,172.6595, "16/10/1910"
"CourierNew" # 000000,10.00,177.0699,244.6392,177.0699,274.6392,169.2099,274.6392,169.2099,244.6392, "54.00"
"CourierNew" # 000000,10.00,177.0992,358.6065,177.0992,388.6065,169.2392,388.6065,169.2392,358.6065, "35.04"
"CourierNew" # 000000,10.00,177.0992,406.5928,177.0992,430.5928,169.2392,430.5928,169.2392,406.5928, "0.40"
"CourierNew" # 000000,10.00,177.0699,460.5775,177.0699,484.5775,169.2099,484.5775,169.2099,460.5775, "0.22"
"CourierNew" # 000000,10.00,177.0699,544.5537,177.0699,568.5537,169.2099,568.5537,169.2099,544.5537, "0.04"
"CourierNew" # 000000,10.00,177.0992,658.5215,177.0992,688.5215,169.2392,688.5215,169.2392,658.5215, "88.78"
"CourierNew" # 000000,10.00,177.0992,754.4942,177.0992,802.4942,169.2392,802.4942,169.2392,754.4942, "DOMESTIC"

This does not allow me to identify what fields by position.

For the tests I've done with the libraries 0718 and 0721 versions both Delphi 2009 and Delphi 2010 I suspect that this is the PDF.

What conditions have to be met by the PDF that can extract the data as the first choice?
Is there any other function that allows me to always remove the first instance?

Please help me in this problem that I've tried everything in my power foulbrood.

Thank you very much.



Replies:
Posted By: Ingo
Date Posted: 05 Nov 10 at 10:47AM
Hi!

With option "0" you can get the pdf-content without the leading data like positions, font, ...
with option "3" you'll get the content as strings - strings like they were inserted (first in first out) with positions and so on.
Option "4" works like option "3" but only word by word.
What the position data stands for you can read in the reference (x1, y1, x2, y2).
Where is your problem now?

Cheers and welcome here, Ingo



Posted By: gcaffe
Date Posted: 16 Nov 10 at 7:54PM
Hello
Thanks for your reply, I could solve the problem.
I have a question, I have a license of Quick PDF Library V7.18 - Single Developer License Upgrade Protection Standard, Can I upgrade to the Library V7.21 for Delphi 2010?. I downloaded that library but when compiling the compiler need the file QuickPDF0721.pas, which is not in the installation file.
Thanks you
gcaffe


Posted By: Wheeley
Date Posted: 16 Nov 10 at 11:54PM
Did you check the directory <install directory>\DLL\Import\Delphi?

Wheeley


Posted By: Sankara
Date Posted: 21 Apr 11 at 1:04PM
I encountered one problem with ExtractFilePageText function. The function returns incorrect values for some of the files.
I am using ExtractFilePageText function with the option '4'.
The following are the results.
"Arial",#000000,9.96,105.0000,750.2098,107.7591,750.2098,107.7591,750.2098,105.0000,750.2098," "
"Verdana",#000000,9.48,114.8400,750.2098,120.1111,750.2098,120.1111,750.2098,114.8400,750.2098,"L"
"Verdana",#000000,9.48,120.1200,750.2098,125.8367,750.2098,125.8367,750.2098,120.1200,750.2098,"P"
"Verdana",#000000,9.48,125.8800,750.2098,133.2273,750.2098,133.2273,750.2098,125.8800,750.2098,"G"
"Verdana",#000000,9.48,133.2000,750.2098,137.5041,750.2098,137.5041,750.2098,133.2000,750.2098,"/"
"Verdana",#000000,9.48,137.5200,750.2098,144.1373,750.2098,144.1373,750.2098,137.5200,750.2098,"C"
"Verdana",#000000,9.48,144.1200,750.2098,150.1116,750.2098,150.1116,750.2098,144.1200,750.2098,"h"
"Verdana",#000000,9.48,150.1200,750.2098,155.7608,750.2098,155.7608,750.2098,150.1200,750.2098,"e"
"Verdana",#000000,9.48,155.7600,750.2098,164.9750,750.2098,164.9750,750.2098,155.7600,750.2098,"m"
"Verdana",#000000,9.48,165.0000,750.2098,167.5976,750.2098,167.5976,750.2098,165.0000,750.2098,"i"
"Verdana",#000000,9.48,167.6400,750.2098,172.5698,750.2098,172.5698,750.2098,167.6400,750.2098,"c"
"Verdana",#000000,9.48,172.5600,750.2098,178.2483,750.2098,178.2483,750.2098,172.5600,750.2098,"a"
"Verdana",#000000,9.48,178.3200,750.2098,180.9176,750.2098,180.9176,750.2098,178.3200,750.2098,"l"
"Verdana",#000000,9.48,180.8400,750.2098,184.1676,750.2098,184.1676,750.2098,180.8400,750.2098," "
"Verdana",#000000,9.48,184.2000,750.2098,189.1298,750.2098,189.1298,750.2098,184.2000,750.2098,"c"
"Verdana",#000000,9.48,189.2400,750.2098,194.9283,750.2098,194.9283,750.2098,189.2400,750.2098,"a"
"Verdana",#000000,9.48,194.8800,750.2098,198.9186,750.2098,198.9186,750.2098,194.8800,750.2098,"r"
"Verdana",#000000,9.48,198.9600,750.2098,204.8663,750.2098,204.8663,750.2098,198.9600,750.2098,"g"
"Verdana",#000000,9.48,204.8400,750.2098,210.5851,750.2098,210.5851,750.2098,204.8400,750.2098,"o"
"Verdana",#000000,9.48,210.6000,750.2098,213.9276,750.2098,213.9276,750.2098,210.6000,750.2098," "
"Verdana",#000000,9.48,213.9600,750.2098,219.5629,750.2098,219.5629,750.2098,213.9600,750.2098,"v"
"Verdana",#000000,9.48,219.6000,750.2098,225.2883,750.2098,225.2883,750.2098,219.6000,750.2098,"a"
"Verdana",#000000,9.48,225.2400,750.2098,231.1463,750.2098,231.1463,750.2098,225.2400,750.2098,"p"
"Verdana",#000000,9.48,231.1200,750.2098,236.8651,750.2098,236.8651,750.2098,231.1200,750.2098,"o"
"Verdana",#000000,9.48,237.0000,750.2098,242.9916,750.2098,242.9916,750.2098,237.0000,750.2098,"u"
"Verdana",#000000,9.48,243.0000,750.2098,247.0386,750.2098,247.0386,750.2098,243.0000,750.2098,"r"
"Verdana",#000000,9.48,247.0800,750.2098,252.0098,750.2098,252.0098,750.2098,247.0800,750.2098,"s"
"Verdana",#000000,9.48,252.0000,750.2098,255.3276,750.2098,255.3276,750.2098,252.0000,750.2098," "
"Verdana",#000000,9.48,255.3600,750.2098,259.0953,750.2098,259.0953,750.2098,255.3600,750.2098,"t"
"Verdana",#000000,9.48,259.0800,750.2098,264.8251,750.2098,264.8251,750.2098,259.0800,750.2098,"o"
"Verdana",#000000,9.48,264.8400,750.2098,268.1676,750.2098,268.1676,750.2098,264.8400,750.2098," "

If you see the above results, the following are the problems.
a. The font color is null
b. The coordinates Y1,Y2,Y3 and Y4 are same for all extracted content.
 
I am using these coordinates to highlight the text after search for a string in the page.
 
Can anyone help me how to solve this issue?


Posted By: AndrewC
Date Posted: 01 Jul 11 at 3:14PM
We have been improving the Text Extraction routines in 7.25, 7.26 versions and the text results directly above should now extract correctly with the new versions.  7.26 beta has more improvements.

Also, it is a free upgrade from any 7.xx version to 2.25, 7.26...  The latest version can always be downloaded from  http://www.quickpdflibrary.com/products/quickpdf/updates.php - http://www.quickpdflibrary.com/products/quickpdf/updates.php

Text Extraction with option 0 is faster but much less accurate than extraction with option 3 or 4.  

Also, we have added option 5 and 6 into the 7.26 version that also returns individual character widths.

Andrew.



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk