Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Using ExtractFilePageText with incorrect results
  FAQ FAQ  Forum Search   Register Register  Login Login

Using ExtractFilePageText with incorrect results

 Post Reply Post Reply
Author
Message
gcaffe View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Nov 10
Location: Spain
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote gcaffe Quote  Post ReplyReply Direct Link To This Post Topic: Using ExtractFilePageText with incorrect results
    Posted: 04 Nov 10 at 7:28PM
Hi:
 
I have an application in Delphi 2009 which use the code as follows ExtractFilePageText

          QP.LoadFromFile (edFilePathPdf.Text) / / Load the PDF in memory
          
PageCount: = QP.PageCount () / / Count the number of pages in the document
          
for i: = 1 to PageCount + 1 do begin / / Go through all the pages of the document
            
TextOutput: = TextOutput + QP.ExtractFilePageText (edFilePathPdf.Text,'', i, 3); / / Extract the text throughout the PDF page by page
          
end;

The result of TextOutput I record in a text that is displayed as well (only an excerpt):
"CourierNew" # 000000,10.00,169.0700,28.7007,169.0700,820.7007,161.2100,820.7007,161.2100,28.7007, "030EUR3744014877 2 34 04/09/10 74.00 47.44 121.44 DOMESTIC"
"CourierNew" # 000000,10.00,177.0700,28.7007,177.0700,820.7007,169.2100,820.7007,169.2100,28.7007, "030EUR3744014878 3 34 04/09/10 74.00 47.44 121.44 DOMESTIC"
"CourierNew" # 000000,10.00,185.0700,28.7007,185.0700,820.7007,177.2100,820.7007,177.2100,28.7007, "996EUR3744014889 0 234 14/09/10 36.00 30.52 0.37 0.13 0.02 66 , 37 DOMESTIC "
"CourierNew" # 000000,10.00,193.0700,28.7007,193.0700,820.7007,185.2100,820.7007,185.2100,28.7007, "996EUR3744014890 1 234 14/09/10 36.00 30.52 0.37 0.13 0.02 66 , 37 DOMESTIC "

As you can see it is easy to identify the fields by position if remove all information about the coordinates. So far right.

However, using the same code for another PDF delphi TextOutput get a different, as if executed QP.ExtractFilePageText (edFilePathPdf.Text,'', i, 4), eg

"CourierNew" # 000000,10.00,169.0700,28.7007,169.0922,154.6681,161.2322,154.6681,161.2100,28.7007, "680EUR1656635612 June 1934"
"CourierNew" # 000000,10.00,169.0700,172.6596,169.0700,220.6596,161.2100,220.6596,161.2100,172.6596, "10/06/1910"
"CourierNew" # 000000,10.00,169.0700,244.6393,169.0700,274.6393,161.2100,274.6393,161.2100,244.6393, "94.00"
"CourierNew" # 000000,10.00,169.0922,358.6066,169.0922,388.6066,161.2322,388.6066,161.2322,358.6066, "13.13"
"CourierNew" # 000000,10.00,169.0922,406.5929,169.0922,430.5929,161.2322,430.5929,161.2322,406.5929, "0.40"
"CourierNew" # 000000,10.00,169.0700,460.5776,169.0700,484.5776,161.2100,484.5776,161.2100,460.5776, "0.38"
"CourierNew" # 000000,10.00,169.0700,544.5538,169.0700,568.5538,161.2100,568.5538,161.2100,544.5538, "0.07"
"CourierNew" # 000000,10.00,169.0700,652.5233,169.0700,688.5233,161.2100,688.5233,161.2100,652.5233, "106.68"
"CourierNew" # 000000,10.00,169.0922,754.4946,169.0922,802.4946,161.2322,802.4946,161.2322,754.4946, "DOMESTIC"
"CourierNew" # 000000,10.00,177.0699,28.7006,177.0992,154.6680,169.2392,154.6680,169.2099,28.7006, "680EUR1656635630 March 1934"
"CourierNew" # 000000,10.00,177.0699,172.6595,177.0699,220.6595,169.2099,220.6595,169.2099,172.6595, "16/10/1910"
"CourierNew" # 000000,10.00,177.0699,244.6392,177.0699,274.6392,169.2099,274.6392,169.2099,244.6392, "54.00"
"CourierNew" # 000000,10.00,177.0992,358.6065,177.0992,388.6065,169.2392,388.6065,169.2392,358.6065, "35.04"
"CourierNew" # 000000,10.00,177.0992,406.5928,177.0992,430.5928,169.2392,430.5928,169.2392,406.5928, "0.40"
"CourierNew" # 000000,10.00,177.0699,460.5775,177.0699,484.5775,169.2099,484.5775,169.2099,460.5775, "0.22"
"CourierNew" # 000000,10.00,177.0699,544.5537,177.0699,568.5537,169.2099,568.5537,169.2099,544.5537, "0.04"
"CourierNew" # 000000,10.00,177.0992,658.5215,177.0992,688.5215,169.2392,688.5215,169.2392,658.5215, "88.78"
"CourierNew" # 000000,10.00,177.0992,754.4942,177.0992,802.4942,169.2392,802.4942,169.2392,754.4942, "DOMESTIC"

This does not allow me to identify what fields by position.

For the tests I've done with the libraries 0718 and 0721 versions both Delphi 2009 and Delphi 2010 I suspect that this is the PDF.

What conditions have to be met by the PDF that can extract the data as the first choice?
Is there any other function that allows me to always remove the first instance?

Please help me in this problem that I've tried everything in my power foulbrood.

Thank you very much.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 05 Nov 10 at 10:47AM
Hi!

With option "0" you can get the pdf-content without the leading data like positions, font, ...
with option "3" you'll get the content as strings - strings like they were inserted (first in first out) with positions and so on.
Option "4" works like option "3" but only word by word.
What the position data stands for you can read in the reference (x1, y1, x2, y2).
Where is your problem now?

Cheers and welcome here, Ingo

Back to Top
gcaffe View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Nov 10
Location: Spain
Status: Offline
Points: 2
Post Options Post Options   Thanks (0) Thanks(0)   Quote gcaffe Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 10 at 7:54PM
Hello
Thanks for your reply, I could solve the problem.
I have a question, I have a license of Quick PDF Library V7.18 - Single Developer License Upgrade Protection Standard, Can I upgrade to the Library V7.21 for Delphi 2010?. I downloaded that library but when compiling the compiler need the file QuickPDF0721.pas, which is not in the installation file.
Thanks you
gcaffe
Back to Top
Wheeley View Drop Down
Senior Member
Senior Member
Avatar

Joined: 30 Oct 05
Location: United States
Status: Offline
Points: 146
Post Options Post Options   Thanks (0) Thanks(0)   Quote Wheeley Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 10 at 11:54PM
Did you check the directory <install directory>\DLL\Import\Delphi?

Wheeley
Back to Top
Sankara View Drop Down
Beginner
Beginner


Joined: 21 Apr 11
Location: Chennai
Status: Offline
Points: 3
Post Options Post Options   Thanks (0) Thanks(0)   Quote Sankara Quote  Post ReplyReply Direct Link To This Post Posted: 21 Apr 11 at 1:04PM
I encountered one problem with ExtractFilePageText function. The function returns incorrect values for some of the files.
I am using ExtractFilePageText function with the option '4'.
The following are the results.
"Arial",#000000,9.96,105.0000,750.2098,107.7591,750.2098,107.7591,750.2098,105.0000,750.2098," "
"Verdana",#000000,9.48,114.8400,750.2098,120.1111,750.2098,120.1111,750.2098,114.8400,750.2098,"L"
"Verdana",#000000,9.48,120.1200,750.2098,125.8367,750.2098,125.8367,750.2098,120.1200,750.2098,"P"
"Verdana",#000000,9.48,125.8800,750.2098,133.2273,750.2098,133.2273,750.2098,125.8800,750.2098,"G"
"Verdana",#000000,9.48,133.2000,750.2098,137.5041,750.2098,137.5041,750.2098,133.2000,750.2098,"/"
"Verdana",#000000,9.48,137.5200,750.2098,144.1373,750.2098,144.1373,750.2098,137.5200,750.2098,"C"
"Verdana",#000000,9.48,144.1200,750.2098,150.1116,750.2098,150.1116,750.2098,144.1200,750.2098,"h"
"Verdana",#000000,9.48,150.1200,750.2098,155.7608,750.2098,155.7608,750.2098,150.1200,750.2098,"e"
"Verdana",#000000,9.48,155.7600,750.2098,164.9750,750.2098,164.9750,750.2098,155.7600,750.2098,"m"
"Verdana",#000000,9.48,165.0000,750.2098,167.5976,750.2098,167.5976,750.2098,165.0000,750.2098,"i"
"Verdana",#000000,9.48,167.6400,750.2098,172.5698,750.2098,172.5698,750.2098,167.6400,750.2098,"c"
"Verdana",#000000,9.48,172.5600,750.2098,178.2483,750.2098,178.2483,750.2098,172.5600,750.2098,"a"
"Verdana",#000000,9.48,178.3200,750.2098,180.9176,750.2098,180.9176,750.2098,178.3200,750.2098,"l"
"Verdana",#000000,9.48,180.8400,750.2098,184.1676,750.2098,184.1676,750.2098,180.8400,750.2098," "
"Verdana",#000000,9.48,184.2000,750.2098,189.1298,750.2098,189.1298,750.2098,184.2000,750.2098,"c"
"Verdana",#000000,9.48,189.2400,750.2098,194.9283,750.2098,194.9283,750.2098,189.2400,750.2098,"a"
"Verdana",#000000,9.48,194.8800,750.2098,198.9186,750.2098,198.9186,750.2098,194.8800,750.2098,"r"
"Verdana",#000000,9.48,198.9600,750.2098,204.8663,750.2098,204.8663,750.2098,198.9600,750.2098,"g"
"Verdana",#000000,9.48,204.8400,750.2098,210.5851,750.2098,210.5851,750.2098,204.8400,750.2098,"o"
"Verdana",#000000,9.48,210.6000,750.2098,213.9276,750.2098,213.9276,750.2098,210.6000,750.2098," "
"Verdana",#000000,9.48,213.9600,750.2098,219.5629,750.2098,219.5629,750.2098,213.9600,750.2098,"v"
"Verdana",#000000,9.48,219.6000,750.2098,225.2883,750.2098,225.2883,750.2098,219.6000,750.2098,"a"
"Verdana",#000000,9.48,225.2400,750.2098,231.1463,750.2098,231.1463,750.2098,225.2400,750.2098,"p"
"Verdana",#000000,9.48,231.1200,750.2098,236.8651,750.2098,236.8651,750.2098,231.1200,750.2098,"o"
"Verdana",#000000,9.48,237.0000,750.2098,242.9916,750.2098,242.9916,750.2098,237.0000,750.2098,"u"
"Verdana",#000000,9.48,243.0000,750.2098,247.0386,750.2098,247.0386,750.2098,243.0000,750.2098,"r"
"Verdana",#000000,9.48,247.0800,750.2098,252.0098,750.2098,252.0098,750.2098,247.0800,750.2098,"s"
"Verdana",#000000,9.48,252.0000,750.2098,255.3276,750.2098,255.3276,750.2098,252.0000,750.2098," "
"Verdana",#000000,9.48,255.3600,750.2098,259.0953,750.2098,259.0953,750.2098,255.3600,750.2098,"t"
"Verdana",#000000,9.48,259.0800,750.2098,264.8251,750.2098,264.8251,750.2098,259.0800,750.2098,"o"
"Verdana",#000000,9.48,264.8400,750.2098,268.1676,750.2098,268.1676,750.2098,264.8400,750.2098," "

If you see the above results, the following are the problems.
a. The font color is null
b. The coordinates Y1,Y2,Y3 and Y4 are same for all extracted content.
 
I am using these coordinates to highlight the text after search for a string in the page.
 
Can anyone help me how to solve this issue?
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 01 Jul 11 at 3:14PM
We have been improving the Text Extraction routines in 7.25, 7.26 versions and the text results directly above should now extract correctly with the new versions.  7.26 beta has more improvements.

Also, it is a free upgrade from any 7.xx version to 2.25, 7.26...  The latest version can always be downloaded from http://www.quickpdflibrary.com/products/quickpdf/updates.php

Text Extraction with option 0 is faster but much less accurate than extraction with option 3 or 4.  

Also, we have added option 5 and 6 into the 7.26 version that also returns individual character widths.

Andrew.
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store