Extract text
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=423
Printed Date: 22 Nov 24 at 7:13PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Extract text
Posted By: Quicker
Subject: Extract text
Date Posted: 27 May 06 at 3:08AM
Hello.
Sometimes GetPageText returns nothing (though PDF contains text). Why it happens?
Many thanks.
|
Replies:
Posted By: JanN
Date Posted: 27 May 06 at 6:38AM
I think that depends on codepages and fonts. QuickPdf is not able to work with all.
|
Posted By: Ingo
Date Posted: 27 May 06 at 9:08AM
Hi!
Perhaps the relevant pdf-files are only scanned? Scanner are scanning as images... and images are without text ;-)
Best regards,
Ingo
|
Posted By: Quicker
Date Posted: 27 May 06 at 9:35AM
Ingo wrote:
Hi!
Perhaps the relevant pdf-files are only scanned? Scanner are scanning as images... and images are without text ;-)
Best regards, Ingo
|
Hi Ingo.
No, the PDF file isn't scanned, I can extract text using Adobe Acrobat.
|
Posted By: Ingo
Date Posted: 28 May 06 at 8:04AM
Hi!
You can send it to me or you can put it anywhere online. So i (or anybody here) can download and test it.
Best regards,
Ingo
|
Posted By: Quicker
Date Posted: 28 May 06 at 10:51AM
Can I put my PDF file here?
|
Posted By: Ingo
Date Posted: 28 May 06 at 1:22PM
Hi!
I don't think so ...
If you don't have any online-space you can send it to me and i'll put it online for all.
ingo[dot]schmoekel[at]ewetel[dot]net
Best regards,
Ingo
|
Posted By: Quicker
Date Posted: 28 May 06 at 1:55PM
Ingo, please check your email account.
|
Posted By: Ingo
Date Posted: 28 May 06 at 2:57PM
'Till now i didn't get anything
|
Posted By: ukobsa
Date Posted: 29 May 06 at 3:37AM
Hi,
I have the same problem. My testfile is a very simple one: I started a new OpenOffice (2.0) document, entered one word "Test" and exported it to pdf.
With this PDF nothing is extracted. I also have the same problems using pdf's generated by a TeX system.
greetings,
Ulrich
|
Posted By: Quicker
Date Posted: 29 May 06 at 7:31AM
Please check accounts on ewetel.net and pdf-analyzer.com
|
Posted By: Ingo
Date Posted: 29 May 06 at 7:46AM
Hi Ulrich!
I've done the same with Word and the PDFCreator.
Extraction is possible:
First LoadFromFile
then SaveToFile //only to be sure that the file is readable with quickpdf
again LoadFromFile //the same saved file
then DAExtractPageText //with option 3!!!
Best regards,
Ingo
|
Posted By: Ingo
Date Posted: 29 May 06 at 7:58AM
Hi Quicker!
I didn't get any files from you.
Put them anywhere online and i'll see.
I think what i've written to Ulrich would help you, too.
Best regards,
Ingo
|
Posted By: ukobsa
Date Posted: 29 May 06 at 9:50AM
Hi Ingo,
thanks for your help but unfortunatly it doesn't work. It still cannot extract the word 'Test'. It only extracts the additional information:
"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.7000,776.6920,77.4240,776.6920,77.4240,784.7920,56.7000,784.7920,""
Also when I save the file and reload it bofore then it cannot extract anything (That's why I have set it in comments oin the code below).
here's the code I use (based on code of one of your former postings)
FName := 'c:\temp\test4.pdf';
QP := TiSEDQuickPDF.Create;
try
QP.UnlockKey('');
dafh := QP.DAOpenFile(FName, '');
//QP.SaveToFile(FName);
//dafh := QP.DAOpenFile(FName, '');
x := QP.DAGetPageCount(dafh);
STR := '';
AssignFile(cf, FName + '_ex2.txt');
Rewrite(cf);
i1 := 1;
pc := 0;
for i := 1 to x do
begin
dapr := QP.DAFindPage(dafh, i);
STR := QP.DAExtractPageText(dafh, dapr, 3);
WriteLn(cf, Trim(STR));
pc := pc + 1;
if (pc = 100) then
begin
pc := 0;
QP.DACloseFile(dafh);
QP.Free;
QP := TiSEDQuickPDF.Create;
QP.UnlockKey('');
dafh := QP.DAOpenFile(FName, '');
end;
end;
QP.DACloseFile(dafh);
CloseFile(cf);
finally
QP.Free;
end;
Do you have any additional idea? As far as I have seen from looking on the code it seems that QuickPDF has problems this text, where the single letters are referenced objects (?)
I have emailed my test-PDF to you.
greetings,
Ulrich
|
Posted By: Ingo
Date Posted: 29 May 06 at 3:25PM
Hi Ulrich!
I've written already to you...
A last idea:
What about CombineLayers before extraction?
Best regards,
Ingo
|
Posted By: Quicker
Date Posted: 30 May 06 at 12:56AM
ukobsa wrote:
Hi Ingo,
here's the code I use (based on code of one of your former postings)
greetings, Ulrich |
Hi Ulrich, why did you write QP.Free two times?
|
Posted By: Quicker
Date Posted: 30 May 06 at 12:58AM
Ingo wrote:
Hi Ulrich!
I've written already to you... A last idea: What about CombineLayers before extraction?
Best regards, Ingo
|
Ingo, please write your solution (what you wrote to Ulrich) here...
|
Posted By: Ingo
Date Posted: 30 May 06 at 2:21AM
Hi Quicker!
It's the code here in the thread.
Best regards,
Ingo
|
Posted By: Ingo
Date Posted: 30 May 06 at 2:24AM
"...why did you write QP.Free two times?..."
Hi Quicker!
I've done it to prevent memory-problems.
Each 100 pages i'm starting new. So i can extract any document.
Best regards,
Ingo
|
Posted By: tren
Date Posted: 30 May 06 at 2:39AM
Hi There,
I'm having a few issues with GetPageText(4), the one that returns each word and its quads. Several of the "words" still contain spaces in them, or they repeat themselves constantly. This issue doesn't happen if I extract a single line with GetPageText(3).
Here is some example output:
By Line:
"EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,492.3365,705.3093,492.3365,717.7753,119.3814,717.7753,"nature, and thereby - or so he thought - freedom. Later, Bentham"
By Word:
"EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,157.6965,705.3093,157.6965,717.7753,119.3814,717.7753,"naturnature,"
"EOFGEO+Palatino-Roman",#000000,12.29,162.4776,705.3093,229.2728,705.3093,229.2728,717.7753,162.4776,717.7753,"and therthereby"
"EOFGEO+Palatino-Roman",#000000,12.29,234.0539,705.3093,240.1997,705.3093,240.1997,717.7753,234.0539,717.7753,"-"
"EOFGEO+Palatino-Roman",#000000,12.29,244.9807,705.3093,256.5469,705.3093,256.5469,717.7753,244.9807,717.7753,"or"
"EOFGEO+Palatino-Roman",#000000,12.29,261.3279,705.3093,273.2506,705.3093,273.2506,717.7753,261.3279,717.7753,"so"
"EOFGEO+Palatino-Roman",#000000,12.29,278.0317,705.3093,291.0730,705.3093,291.0730,717.7753,278.0317,717.7753,"he"
"EOFGEO+Palatino-Roman",#000000,12.29,295.8541,705.3093,339.1324,705.3093,339.1324,717.7753,295.8541,717.7753,"thought"
"EOFGEO+Palatino-Roman",#000000,12.29,343.9135,705.3093,492.3365,705.3093,492.3365,717.7753,343.9135,717.7753,"- frfreedom. LaterLater, Bentham"
Is this a known issue? I'm tempted to do string processing and compare the two outputs but would prefer not to. Any guidance appreciated.
|
|