General Discussion - DAExtractPageText problem

Print Page | Close Window

DAExtractPageText problem

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1667
Printed Date: 13 Mar 26 at 3:07PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: DAExtractPageText problem

Posted By: dpreznik
Subject: DAExtractPageText problem
Date Posted: 03 Dec 10 at 5:53PM

Dear experts,

I am trying to create an application in C# to extract text from pdf. I am using DAExtractPageText() method. But the text returned by this method is distorted. Some characters are missing, and blank spaces are inserted here and there within words.

Could you please tell me if it is possible to fix it?

Thank you very much,

Dmitriy

Replies:

Posted By: Paddy
Date Posted: 03 Dec 10 at 8:16PM

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?

Posted By: Ingo
Date Posted: 04 Dec 10 at 9:56AM

Hi Dmitriy!

Try option "0" ... The same or is it better?
Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Cheers and welcome here,
Ingo

Posted By: dpreznik
Date Posted: 06 Dec 10 at 12:27PM

Paddy wrote:

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?

Hi Paddy,

I am using DLL edition. I am not sure if my PDF contains Unicode characters.

Posted By: dpreznik
Date Posted: 06 Dec 10 at 12:33PM

Ingo wrote:

Hi Dmitriy!

Try option "0" ... The same or is it better?

Hi Ingo,

Thank you for your answer. No, it is not better.

Ingo wrote:

Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

Probably that is what happened to me. And I think there is no solution for it that I could use.

Ingo wrote:

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Thank you very much for the warning.

May I ask one more question?

I found Quick PDF Lite. Would it support extracting images from a PDF document? I tried it, but don't know yet how to apply those methods, that are different from the Professional Quick PDF.

I would use it with C++.

Thank you very much.

Dmitriy

Posted By: Ingo
Date Posted: 06 Dec 10 at 8:18PM

Hi Dmitriy!

You can only extract images you had inserted in the same session.

No chance on other documents.

Cheers, Ingo

Posted By: dpreznik
Date Posted: 06 Dec 10 at 8:20PM

Thank you very much for your answer.< id="gwProxy" ="">< ="ifofjsCall==''jsCall;elsesetTimeout'jsCall',500;" id="jsProxy" ="">

Posted By: Giuseppe
Date Posted: 13 Dec 10 at 11:05AM

hi, the algoritm is corrupted, you must use a work around, set deltax and deltay and remake the words...

Posted By: billycl
Date Posted: 23 Feb 11 at 7:02PM

DAExtractPageText with Options=4 return
TQuickPDF0723.AddArcToPath(CenterX, as 1 word
I think now only space character is delimiter
Is it possible (in future) to define more delimiters "(),.:-"
I see this result
tquickpdf0723. addarctopath( centerx,
in software which use Adobe Acrobat Pro (acrobat = slow)