Testing QuickPDF for text extraction performance
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2152
Printed Date: 23 Nov 24 at 4:25AM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Testing QuickPDF for text extraction performance
Posted By: pcunite
Subject: Testing QuickPDF for text extraction performance
Date Posted: 15 Feb 12 at 8:40PM
I am evaluating the QuickPDF library (.dll version) for use in a C++ application. The only functionally I need it to extract the text. The entire PDF's text will be placed in memory and then I'll search for keyword terms.
Is QuickPDF suitable for this type of work and offer good performance?
|
Replies:
Posted By: Ingo
Date Posted: 15 Feb 12 at 8:46PM
So you didn't read the documents from the original support pages of the publishers ;-)
Hi!
The searching could be done with your programming language and the textextraction could be done with QuickPDF with several kinds of options. Your performance-question: It always depends on ... Try it ;-) http://www.quickpdflibrary.com/help/quickpdf/Extraction.php
Cheers and welcome here, Ingo
|
Posted By: pcunite
Date Posted: 15 Feb 12 at 9:00PM
Well, yes I've read some of the materials. I'm looking at about 5 different solutions and wanted my hand held a little :)
I know how to use you're library, just wanted a fuzzy feeling that it is up to the task for my requirements. Some PDF libraries are more for creation or editing ... I just want the text as fast as I can. Is QuickPDF optimized for this?
P.S.
I did not find it referenced anywhere, but can the .LIB version work with C++ Builder 2007 or is that for only Visual Studio? The .DLL version is fine, just wondering.
|
Posted By: Ingo
Date Posted: 15 Feb 12 at 9:05PM
Hi!
This library offers over 500 functions for a low price. Textextraction was already in the first versions many years ago. So this should be stable but it won't be optimized specially for textextraction. Personal opinions will be different so you have to try.
Cheers, Ingo
|
Posted By: pcunite
Date Posted: 15 Feb 12 at 10:07PM
Thank you for your help. I'm testing the sample function below. Is this the fastest way? I just want to make sure I'm doing all I can. Also, I don't understand http://www.quickpdflibrary.com/help/quickpdf/DASetTextExtractionOptions.php - DASetTextExtractionOptions ... should I use it to optimize anything?
size_t GetText_PDF(std::wstring & sF, std::wstring & sTxt)
{
int FH, PR, iPages;
// Open file readonly
FH = QP.DAOpenFileReadOnly(sF, L"");
if(FH == 0){return FILE_ERROR_OPEN;}
// Get page count
iPages = QP.DAGetPageCount(FH);
// loop over pages
for(int i = 1; i <= iPages; i++)
{
// Get a page reference to the current page
PR = QP.DAFindPage(FH, i);
// Extract the text from the current page
sTxt += QP.DAExtractPageText(FH, PR, 8);
}
// Close file
QP.DACloseFile(FH);
return FILE_SUCCESS;
} |
|
Posted By: Ingo
Date Posted: 15 Feb 12 at 10:16PM
sTxt += QP.DAExtractPageText(FH, PR, 0);
0 should be the fastest. If 0 is useful for you depends on what you wanna do with the text.
|
Posted By: pcunite
Date Posted: 15 Feb 12 at 10:23PM
Ingo wrote:
sTxt += QP.DAExtractPageText(FH, PR, 0);0 should be the fastest.If 0 is useful for you depends on what you wanna do with the text. |
I only want to know if the word "blah" appears in the PDF file. I understand that I can't do this with image only PDF files ... that is okay. Thus I load all the strings into a buffer and then I'll search myself for "blah".
|
Posted By: AndrewC
Date Posted: 22 Mar 12 at 10:58AM
With many modern PDF files options 0, 1, 2 don't extract text from some complex PDF's. Option 8 is an improved version of option 0. Options 3,4,5,6,7 use the improved extraction logic.
Andrew.
|
|