Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage

Forum Home

Forum Home > For Users of the Library > I need help - I can help

New Posts

RSS Feed - Extract text

FAQ

FAQ

Register

Login

Extract text

Post Reply

Author

Topic Search

Topic Options

Topic Options

Create New Topic

Printable Version

Translate Topic

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Topic: Extract text
Posted: 27 May 06 at 3:08AM

Hello.

Sometimes GetPageText returns nothing (though PDF contains text). Why it happens?

Many thanks.

Back to Top

JanN

View Drop Down

Members Profile

Find Members Posts

Senior Member

Senior Member

Joined: 29 Oct 05
Location: Germany
Status: Offline
Points: 116

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote JanN

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 27 May 06 at 6:38AM

I think that depends on codepages and fonts. QuickPdf is not able to work with all.

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 27 May 06 at 9:08AM

Hi!

Perhaps the relevant pdf-files are only scanned? Scanner are scanning as images... and images are without text ;-)

Best regards,
Ingo

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 27 May 06 at 9:35AM

Originally posted by Ingo

Ingo wrote:

Hi!

Perhaps the relevant pdf-files are only scanned? Scanner are scanning as images... and images are without text ;-)

Best regards,
Ingo

Hi Ingo.

No, the PDF file isn't scanned, I can extract text using Adobe Acrobat.

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 28 May 06 at 8:04AM

Hi!
You can send it to me or you can put it anywhere online. So i (or anybody here) can download and test it.
Best regards,
Ingo

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 28 May 06 at 10:51AM

Can I put my PDF file here?

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 28 May 06 at 1:22PM

Hi!

I don't think so ...
If you don't have any online-space you can send it to me and i'll put it online for all.
ingo[dot]schmoekel[at]ewetel[dot]net

Best regards,
Ingo

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 28 May 06 at 1:55PM

Ingo, please check your email account.

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 28 May 06 at 2:57PM

'Till now i didn't get anything

Back to Top

ukobsa

View Drop Down

Members Profile

Find Members Posts

Senior Member

Senior Member

Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote ukobsa

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 3:37AM

Hi,

I have the same problem. My testfile is a very simple one: I started a new OpenOffice (2.0) document, entered one word "Test" and exported it to pdf.
With this PDF nothing is extracted. I also have the same problems using pdf's generated by a TeX system.

greetings,
Ulrich

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 7:31AM

Please check accounts on ewetel.net and pdf-analyzer.com

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 7:46AM

Hi Ulrich!

I've done the same with Word and the PDFCreator.
Extraction is possible:
First LoadFromFile
then SaveToFile //only to be sure that the file is readable with quickpdf
again LoadFromFile //the same saved file
then DAExtractPageText //with option 3!!!

Best regards,
Ingo

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 7:58AM

Hi Quicker!

I didn't get any files from you.
Put them anywhere online and i'll see.
I think what i've written to Ulrich would help you, too.

Best regards,
Ingo

Back to Top

ukobsa

View Drop Down

Members Profile

Find Members Posts

Senior Member

Senior Member

Joined: 29 May 06
Location: Germany
Status: Offline
Points: 115

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote ukobsa

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 9:50AM

Hi Ingo,

thanks for your help but unfortunatly it doesn't work. It still cannot extract the word 'Test'. It only extracts the additional information:

"BAAAAA+TimesNewRomanPSMT",#000000,12.00,56.7000,776.6920,77.4240,776.6920,77.4240,784.7920,56.7000,784.7920,""

Also when I save the file and reload it bofore then it cannot extract anything (That's why I have set it in comments oin the code below).

here's the code I use (based on code of one of your former postings)

FName := 'c:\temp\test4.pdf';
QP := TiSEDQuickPDF.Create;
try
    QP.UnlockKey('');
    dafh := QP.DAOpenFile(FName, '');
    //QP.SaveToFile(FName);
    //dafh := QP.DAOpenFile(FName, '');
    x := QP.DAGetPageCount(dafh);
    STR := '';

    AssignFile(cf, FName + '_ex2.txt');
    Rewrite(cf);

    i1 := 1;
    pc := 0;

    for i := 1 to x do
    begin
      dapr := QP.DAFindPage(dafh, i);
      STR := QP.DAExtractPageText(dafh, dapr, 3);
      WriteLn(cf, Trim(STR));
      pc := pc + 1;
      if (pc = 100) then
      begin
        pc := 0;
        QP.DACloseFile(dafh);
        QP.Free;
        QP := TiSEDQuickPDF.Create;
        QP.UnlockKey('');
        dafh := QP.DAOpenFile(FName, '');
      end;
    end;
    QP.DACloseFile(dafh);
    CloseFile(cf);
finally
    QP.Free;
end;

Do you have any additional idea? As far as I have seen from looking on the code it seems that QuickPDF has problems this text, where the single letters are referenced objects (?)

I have emailed my test-PDF to you.

greetings,
Ulrich

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 29 May 06 at 3:25PM

Hi Ulrich!

I've written already to you...
A last idea:
What about CombineLayers before extraction?

Best regards,
Ingo

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 30 May 06 at 12:56AM

Originally posted by ukobsa

ukobsa wrote:

Hi Ingo,

here's the code I use (based on code of one of your former postings)

greetings,
Ulrich

Hi Ulrich,
why did you write QP.Free two times?

Back to Top

Quicker

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 27 Apr 06
Status: Offline
Points: 14

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Quicker

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 30 May 06 at 12:58AM

Originally posted by Ingo

Ingo wrote:

Hi Ulrich!

I've written already to you...
A last idea:
What about CombineLayers before extraction?

Best regards,
Ingo

Ingo,
please write your solution (what you wrote to Ulrich) here...

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 30 May 06 at 2:21AM

Hi Quicker!

It's the code here in the thread.

Best regards,
Ingo

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 30 May 06 at 2:24AM

"...why did you write QP.Free two times?..."

Hi Quicker!

I've done it to prevent memory-problems.
Each 100 pages i'm starting new. So i can extract any document.

Best regards,
Ingo

Back to Top

tren

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 07 Feb 06
Location: Australia
Status: Offline
Points: 5

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote tren

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 30 May 06 at 2:39AM

Hi There,

I'm having a few issues with GetPageText(4), the one that returns each word and its quads. Several of the "words" still contain spaces in them, or they repeat themselves constantly. This issue doesn't happen if I extract a single line with GetPageText(3).

Here is some example output:

By Line:
"EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,492.3365,705.3093,492.3365,717.7753,119.3814,717.7753,"nature, and thereby - or so he thought - freedom. Later, Bentham"

By Word:
"EOFGEO+Palatino-Roman",#000000,12.29,119.3814,705.3093,157.6965,705.3093,157.6965,717.7753,119.3814,717.7753,"naturnature,"
"EOFGEO+Palatino-Roman",#000000,12.29,162.4776,705.3093,229.2728,705.3093,229.2728,717.7753,162.4776,717.7753,"and therthereby"
"EOFGEO+Palatino-Roman",#000000,12.29,234.0539,705.3093,240.1997,705.3093,240.1997,717.7753,234.0539,717.7753,"-"
"EOFGEO+Palatino-Roman",#000000,12.29,244.9807,705.3093,256.5469,705.3093,256.5469,717.7753,244.9807,717.7753,"or"
"EOFGEO+Palatino-Roman",#000000,12.29,261.3279,705.3093,273.2506,705.3093,273.2506,717.7753,261.3279,717.7753,"so"
"EOFGEO+Palatino-Roman",#000000,12.29,278.0317,705.3093,291.0730,705.3093,291.0730,717.7753,278.0317,717.7753,"he"
"EOFGEO+Palatino-Roman",#000000,12.29,295.8541,705.3093,339.1324,705.3093,339.1324,717.7753,295.8541,717.7753,"thought"
"EOFGEO+Palatino-Roman",#000000,12.29,343.9135,705.3093,492.3365,705.3093,492.3365,717.7753,343.9135,717.7753,"- frfreedom. LaterLater, Bentham"

Is this a known issue? I'm tempted to do string processing and compare the two outputs but would prefer not to. Any guidance appreciated.

Back to Top

Post Reply
Tweet

Forum Jump

Forum Permissions View Drop Down

View Drop Down

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store