Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
![]() |
Extract text |
Post Reply ![]() |
Author | |
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() Posted: 27 May 06 at 3:08AM |
Hello. Sometimes GetPageText returns nothing (though PDF contains text). Why it happens? Many thanks. |
|
![]() |
|
JanN ![]() Senior Member ![]() Joined: 29 Oct 05 Location: Germany Status: Offline Points: 116 |
![]() ![]() ![]() ![]() ![]() |
I think that depends on codepages and fonts. QuickPdf is not able to work with all.
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi!
Perhaps the relevant pdf-files are only scanned? Scanner are scanning as images... and images are without text ;-) Best regards, Ingo |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Hi Ingo. No, the PDF file isn't scanned, I can extract text using Adobe Acrobat. |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi!
You can send it to me or you can put it anywhere online. So i (or anybody here) can download and test it. Best regards, Ingo |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Can I put my PDF file here?
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi!
I don't think so ... If you don't have any online-space you can send it to me and i'll put it online for all. ingo[dot]schmoekel[at]ewetel[dot]net Best regards, Ingo |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Ingo, please check your email account.
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
'Till now i didn't get anything
|
|
![]() |
|
ukobsa ![]() Senior Member ![]() Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
![]() ![]() ![]() ![]() ![]() |
Hi,
I have the same problem. My testfile is a very simple one: I started a new OpenOffice (2.0) document, entered one word "Test" and exported it to pdf. With this PDF nothing is extracted. I also have the same problems using pdf's generated by a TeX system. greetings, Ulrich |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Please check accounts on ewetel.net and pdf-analyzer.com
|
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Ulrich!
I've done the same with Word and the PDFCreator. Extraction is possible: First LoadFromFile then SaveToFile //only to be sure that the file is readable with quickpdf again LoadFromFile //the same saved file then DAExtractPageText //with option 3!!! Best regards, Ingo |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Quicker!
I didn't get any files from you. Put them anywhere online and i'll see. I think what i've written to Ulrich would help you, too. Best regards, Ingo |
|
![]() |
|
ukobsa ![]() Senior Member ![]() Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
![]() ![]() ![]() ![]() ![]() |
Hi Ingo,
thanks for your help but unfortunatly it doesn't work. It still cannot extract the word 'Test'. It only extracts the additional information: "BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.7000,776.6920,77.4240,776.6920,77.4240,784.7920,56.7000,784.7920,"" Also when I save the file and reload it bofore then it cannot extract anything (That's why I have set it in comments oin the code below). here's the code I use (based on code of one of your former postings) FName := 'c:\temp\test4.pdf'; QP := TiSEDQuickPDF.Create; try QP.UnlockKey(''); dafh := QP.DAOpenFile(FName, ''); //QP.SaveToFile(FName); //dafh := QP.DAOpenFile(FName, ''); x := QP.DAGetPageCount(dafh); STR := ''; AssignFile(cf, FName + '_ex2.txt'); Rewrite(cf); i1 := 1; pc := 0; for i := 1 to x do begin dapr := QP.DAFindPage(dafh, i); STR := QP.DAExtractPageText(dafh, dapr, 3); WriteLn(cf, Trim(STR)); pc := pc + 1; if (pc = 100) then begin pc := 0; QP.DACloseFile(dafh); QP.Free; QP := TiSEDQuickPDF.Create; QP.UnlockKey(''); dafh := QP.DAOpenFile(FName, ''); end; end; QP.DACloseFile(dafh); CloseFile(cf); finally QP.Free; end; Do you have any additional idea? As far as I have seen from looking on the code it seems that QuickPDF has problems this text, where the single letters are referenced objects (?) I have emailed my test-PDF to you. greetings, Ulrich |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Ulrich!
I've written already to you... A last idea: What about CombineLayers before extraction? Best regards, Ingo |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Hi Ulrich, |
|
![]() |
|
Quicker ![]() Beginner ![]() Joined: 27 Apr 06 Status: Offline Points: 14 |
![]() ![]() ![]() ![]() ![]() |
Ingo, |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
Hi Quicker!
It's the code here in the thread. Best regards, Ingo |
|
![]() |
|
Ingo ![]() Moderator Group ![]() ![]() Joined: 29 Oct 05 Status: Offline Points: 3529 |
![]() ![]() ![]() ![]() ![]() |
"...why did you write QP.Free two times?..."
Hi Quicker! I've done it to prevent memory-problems. Each 100 pages i'm starting new. So i can extract any document. Best regards, Ingo |
|
![]() |
|
tren ![]() Beginner ![]() Joined: 07 Feb 06 Location: Australia Status: Offline Points: 5 |
![]() ![]() ![]() ![]() ![]() |
Hi There,
I'm having a few issues with GetPageText(4), the one that returns each word and its quads. Several of the "words" still contain spaces in them, or they repeat themselves constantly. This issue doesn't happen if I extract a single line with GetPageText(3). Here is some example output: By Line: "EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,492.3365,705.3093,492.3365,717.7753,119.3814,717.7753,"nature, and thereby - or so he thought - freedom. Later, Bentham" By Word: "EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,157.6965,705.3093,157.6965,717.7753,119.3814,717.7753,"naturnature," "EOFGEO+Palatino-Roman",#000000,12.29,162.4776,705.3093,229.2728,705.3093,229.2728,717.7753,162.4776,717.7753,"and therthereby" "EOFGEO+Palatino-Roman",#000000,12.29,234.0539,705.3093,240.1997,705.3093,240.1997,717.7753,234.0539,717.7753,"-" "EOFGEO+Palatino-Roman",#000000,12.29,244.9807,705.3093,256.5469,705.3093,256.5469,717.7753,244.9807,717.7753,"or" "EOFGEO+Palatino-Roman",#000000,12.29,261.3279,705.3093,273.2506,705.3093,273.2506,717.7753,261.3279,717.7753,"so" "EOFGEO+Palatino-Roman",#000000,12.29,278.0317,705.3093,291.0730,705.3093,291.0730,717.7753,278.0317,717.7753,"he" "EOFGEO+Palatino-Roman",#000000,12.29,295.8541,705.3093,339.1324,705.3093,339.1324,717.7753,295.8541,717.7753,"thought" "EOFGEO+Palatino-Roman",#000000,12.29,343.9135,705.3093,492.3365,705.3093,492.3365,717.7753,343.9135,717.7753,"- frfreedom. LaterLater, Bentham" Is this a known issue? I'm tempted to do string processing and compare the two outputs but would prefer not to. Any guidance appreciated. |
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store