Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage

Forum Home

Forum Home > For Users of the Library > General Discussion

New Posts

RSS Feed - Testing QuickPDF for text extraction performance

FAQ

FAQ

Register

Login

Testing QuickPDF for text extraction performance

Post Reply

Author

Topic Search

Topic Options

Topic Options

Create New Topic

Printable Version

Translate Topic

pcunite

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 15 Feb 12
Location: USA
Status: Offline
Points: 4

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote pcunite

Quote

Post Reply

Reply

Direct Link To This Post

Topic: Testing QuickPDF for text extraction performance
Posted: 15 Feb 12 at 8:40PM

I am evaluating the QuickPDF library (.dll version) for use in a C++ application. The only functionally I need it to extract the text. The entire PDF's text will be placed in memory and then I'll search for keyword terms.

Is QuickPDF suitable for this type of work and offer good performance?

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 8:46PM

So you didn't read the documents from the original support pages of the publishers ;-)

Hi!

The searching could be done with your programming language
and the textextraction could be done with QuickPDF with several
kinds of options.
Your performance-question: It always depends on ... Try it ;-)
http://www.quickpdflibrary.com/help/quickpdf/Extraction.php

Cheers and welcome here,
Ingo

Back to Top

pcunite

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 15 Feb 12
Location: USA
Status: Offline
Points: 4

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote pcunite

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 9:00PM

Well, yes I've read some of the materials. I'm looking at about 5 different solutions and wanted my hand held a little :)

I know how to use you're library, just wanted a fuzzy feeling that it is up to the task for my requirements. Some PDF libraries are more for creation or editing ... I just want the text as fast as I can. Is QuickPDF optimized for this?

P.S.
I did not find it referenced anywhere, but can the .LIB version work with C++ Builder 2007 or is that for only Visual Studio? The .DLL version is fine, just wondering.

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 9:05PM

Hi!

This library offers over 500 functions for a low price.
Textextraction was already in the first versions many years ago.
So this should be stable but it won't be optimized specially for
textextraction.
Personal opinions will be different so you have to try.

Cheers, Ingo

Back to Top

pcunite

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 15 Feb 12
Location: USA
Status: Offline
Points: 4

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote pcunite

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 10:07PM

Thank you for your help. I'm testing the sample function below. Is this the fastest way? I just want to make sure I'm doing all I can. Also, I don't understand DASetTextExtractionOptions ... should I use it to optimize anything?

Quote

size_t GetText_PDF(std::wstring & sF, std::wstring & sTxt)
{
     int FH, PR, iPages;

    // Open file readonly
    FH = QP.DAOpenFileReadOnly(sF, L"");
    if(FH == 0){return FILE_ERROR_OPEN;}

    // Get page count
    iPages = QP.DAGetPageCount(FH);

    // loop over pages
    for(int i = 1; i <= iPages; i++)
    {
        // Get a page reference to the current page
        PR = QP.DAFindPage(FH, i);

        // Extract the text from the current page
        sTxt += QP.DAExtractPageText(FH, PR, 8);
    }

     // Close file
     QP.DACloseFile(FH);

     return FILE_SUCCESS;
}

Edited by pcunite - 15 Feb 12 at 10:21PM

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 10:16PM

sTxt += QP.DAExtractPageText(FH, PR, 0);

0 should be the fastest.
If 0 is useful for you depends on what you wanna do with the text.

Back to Top

pcunite

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 15 Feb 12
Location: USA
Status: Offline
Points: 4

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote pcunite

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 15 Feb 12 at 10:23PM

Originally posted by Ingo

Ingo wrote:

sTxt += QP.DAExtractPageText(FH, PR, 0);0 should be the fastest.If 0 is useful for you depends on what you wanna do with the text.

I only want to know if the word "blah" appears in the PDF file. I understand that I can't do this with image only PDF files ... that is okay. Thus I load all the strings into a buffer and then I'll search myself for "blah".

Back to Top

AndrewC

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote AndrewC

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 22 Mar 12 at 10:58AM

With many modern PDF files options 0, 1, 2 don't extract text from some complex PDF's. Option 8 is an improved version of option 0. Options 3,4,5,6,7 use the improved extraction logic.

Andrew.

Back to Top

Post Reply
Tweet

Forum Jump

Forum Permissions View Drop Down

View Drop Down

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store