Print Page | Close Window

DAExtractPageText problem

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: General Discussion
Forum Description: Discussion board for Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1667
Printed Date: 22 Nov 24 at 8:02PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: DAExtractPageText problem
Posted By: dpreznik
Subject: DAExtractPageText problem
Date Posted: 03 Dec 10 at 5:53PM
Dear experts,
 
I am trying to create an application in C# to extract text from pdf. I am using DAExtractPageText() method. But the text returned by this method is distorted. Some characters are missing, and blank spaces are inserted here and there within words.
Could you please tell me if it is possible to fix it?
 
Thank you very much,
 
Dmitriy



Replies:
Posted By: Paddy
Date Posted: 03 Dec 10 at 8:16PM
Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?


Posted By: Ingo
Date Posted: 04 Dec 10 at 9:56AM
Hi Dmitriy!

Try option "0" ... The same or is it better?
Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Cheers and welcome here,
Ingo
 


Posted By: dpreznik
Date Posted: 06 Dec 10 at 12:27PM
Originally posted by Paddy Paddy wrote:

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?
Hi Paddy,
 
I am using DLL edition. I am not sure if my PDF contains Unicode characters.


Posted By: dpreznik
Date Posted: 06 Dec 10 at 12:33PM
Originally posted by Ingo Ingo wrote:

Hi Dmitriy!

Try option "0" ... The same or is it better?
Hi Ingo,
 
Thank you for your answer. No, it is not better.
Originally posted by Ingo Ingo wrote:


Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.
Probably that is what happened to me. And I think there is no solution for it that I could use.
Originally posted by Ingo Ingo wrote:


BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-) 
Thank you very much for the warning.
 
May I ask one more question?
I found Quick PDF Lite. Would it support extracting images from a PDF document? I tried it, but don't know yet how to apply those methods, that are different from the Professional Quick PDF.
I would use it with C++.
Thank you very much.
Dmitriy


Posted By: Ingo
Date Posted: 06 Dec 10 at 8:18PM
Hi Dmitriy!
 
You can only extract images you had inserted in the same session.
No chance on other documents.
 
Cheers, Ingo


Posted By: dpreznik
Date Posted: 06 Dec 10 at 8:20PM
Thank you very much for your answer.< id="gwProxy" ="">< ="ifofjsCall==''jsCall;elsesetTimeout'jsCall',500;" id="jsProxy" ="">


Posted By: Giuseppe
Date Posted: 13 Dec 10 at 11:05AM
hi, the algoritm is corrupted, you must use a work around, set deltax and deltay and remake the words...


Posted By: billycl
Date Posted: 23 Feb 11 at 7:02PM
DAExtractPageText with Options=4 return
 TQuickPDF0723.AddArcToPath(CenterX,    as 1 word
I think now only space character is delimiter
Is it possible (in future) to define more delimiters "(),.:-"
I see this result
 tquickpdf0723.     addarctopath(    centerx,
in software which use Adobe Acrobat Pro (acrobat = slow)




Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk