Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > General Discussion
  New Posts New Posts RSS Feed - DAExtractPageText problem
  FAQ FAQ  Forum Search   Register Register  Login Login

DAExtractPageText problem

 Post Reply Post Reply
Author
Message
dpreznik View Drop Down
Beginner
Beginner


Joined: 03 Dec 10
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote dpreznik Quote  Post ReplyReply Direct Link To This Post Topic: DAExtractPageText problem
    Posted: 03 Dec 10 at 5:53PM
Dear experts,
 
I am trying to create an application in C# to extract text from pdf. I am using DAExtractPageText() method. But the text returned by this method is distorted. Some characters are missing, and blank spaces are inserted here and there within words.
Could you please tell me if it is possible to fix it?
 
Thank you very much,
 
Dmitriy
Back to Top
Paddy View Drop Down
Beginner
Beginner


Joined: 24 Mar 10
Status: Offline
Points: 8
Post Options Post Options   Thanks (0) Thanks(0)   Quote Paddy Quote  Post ReplyReply Direct Link To This Post Posted: 03 Dec 10 at 8:16PM
Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 04 Dec 10 at 9:56AM
Hi Dmitriy!

Try option "0" ... The same or is it better?
Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Cheers and welcome here,
Ingo
 
Back to Top
dpreznik View Drop Down
Beginner
Beginner


Joined: 03 Dec 10
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote dpreznik Quote  Post ReplyReply Direct Link To This Post Posted: 06 Dec 10 at 12:27PM
Originally posted by Paddy Paddy wrote:

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?
Hi Paddy,
 
I am using DLL edition. I am not sure if my PDF contains Unicode characters.
Back to Top
dpreznik View Drop Down
Beginner
Beginner


Joined: 03 Dec 10
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote dpreznik Quote  Post ReplyReply Direct Link To This Post Posted: 06 Dec 10 at 12:33PM
Originally posted by Ingo Ingo wrote:

Hi Dmitriy!

Try option "0" ... The same or is it better?
Hi Ingo,
 
Thank you for your answer. No, it is not better.
Originally posted by Ingo Ingo wrote:


Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.
Probably that is what happened to me. And I think there is no solution for it that I could use.
Originally posted by Ingo Ingo wrote:


BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-) 
Thank you very much for the warning.
 
May I ask one more question?
I found Quick PDF Lite. Would it support extracting images from a PDF document? I tried it, but don't know yet how to apply those methods, that are different from the Professional Quick PDF.
I would use it with C++.
Thank you very much.
Dmitriy
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 06 Dec 10 at 8:18PM
Hi Dmitriy!
 
You can only extract images you had inserted in the same session.
No chance on other documents.
 
Cheers, Ingo
Back to Top
dpreznik View Drop Down
Beginner
Beginner


Joined: 03 Dec 10
Status: Offline
Points: 6
Post Options Post Options   Thanks (0) Thanks(0)   Quote dpreznik Quote  Post ReplyReply Direct Link To This Post Posted: 06 Dec 10 at 8:20PM
Thank you very much for your answer.< id="gwProxy" ="">< ="ifofjsCall==''jsCall;elsesetTimeout'jsCall',500;" id="jsProxy" ="">
Back to Top
Giuseppe View Drop Down
Beginner
Beginner
Avatar

Joined: 19 Nov 10
Location: Italy
Status: Offline
Points: 10
Post Options Post Options   Thanks (0) Thanks(0)   Quote Giuseppe Quote  Post ReplyReply Direct Link To This Post Posted: 13 Dec 10 at 11:05AM
hi, the algoritm is corrupted, you must use a work around, set deltax and deltay and remake the words...
Back to Top
billycl View Drop Down
Beginner
Beginner


Joined: 23 Feb 11
Status: Offline
Points: 1
Post Options Post Options   Thanks (0) Thanks(0)   Quote billycl Quote  Post ReplyReply Direct Link To This Post Posted: 23 Feb 11 at 7:02PM
DAExtractPageText with Options=4 return
 TQuickPDF0723.AddArcToPath(CenterX,    as 1 word
I think now only space character is delimiter
Is it possible (in future) to define more delimiters "(),.:-"
I see this result
 tquickpdf0723.     addarctopath(    centerx,
in software which use Adobe Acrobat Pro (acrobat = slow)

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store