Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage

Forum Home

Forum Home > For Users of the Library > General Discussion

New Posts

RSS Feed - DAExtractPageText problem

FAQ

FAQ

Register

Login

DAExtractPageText problem

Post Reply

Author

Topic Search

Topic Options

Topic Options

Create New Topic

Printable Version

Translate Topic

dpreznik

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 03 Dec 10
Status: Offline
Points: 6

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote dpreznik

Quote

Post Reply

Reply

Direct Link To This Post

Topic: DAExtractPageText problem
Posted: 03 Dec 10 at 5:53PM

Dear experts,

I am trying to create an application in C# to extract text from pdf. I am using DAExtractPageText() method. But the text returned by this method is distorted. Some characters are missing, and blank spaces are inserted here and there within words.

Could you please tell me if it is possible to fix it?

Thank you very much,

Dmitriy

Back to Top

Paddy

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 24 Mar 10
Status: Offline
Points: 8

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Paddy

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 03 Dec 10 at 8:16PM

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 04 Dec 10 at 9:56AM

Hi Dmitriy!

Try option "0" ... The same or is it better?
Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Cheers and welcome here,
Ingo

Back to Top

dpreznik

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 03 Dec 10
Status: Offline
Points: 6

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote dpreznik

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 06 Dec 10 at 12:27PM

Originally posted by Paddy

Paddy wrote:

Are you using the DLL edition or the ActiveX edition? And also, does your PDF contain any Unicode characters?

Hi Paddy,

I am using DLL edition. I am not sure if my PDF contains Unicode characters.

Back to Top

dpreznik

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 03 Dec 10
Status: Offline
Points: 6

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote dpreznik

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 06 Dec 10 at 12:33PM

Originally posted by Ingo

Ingo wrote:

Hi Dmitriy!

Try option "0" ... The same or is it better?

Hi Ingo,

Thank you for your answer. No, it is not better.

Originally posted by Ingo

Ingo wrote:

Generally you can say that extraction works
like the textcontent was inserted. First in first out.
If the first word on a page is "ello" and at the end
of the page you see this and insert a "H" before
the "ello", while extraction the "H" was extracted
at the end of the page-content.

With option "4" you can extract word by word with
position-data. Regarding these position data you can
contain the real textrows by your own. There's no
support by QuickPDF.

Probably that is what happened to me. And I think there is no solution for it that I could use.

Originally posted by Ingo

Ingo wrote:

BTW: A small warning... Don't mix DA-functions with
non-DA-functions - this won't work ;-)

Thank you very much for the warning.

May I ask one more question?

I found Quick PDF Lite. Would it support extracting images from a PDF document? I tried it, but don't know yet how to apply those methods, that are different from the Professional Quick PDF.

I would use it with C++.

Thank you very much.

Dmitriy

Back to Top

Ingo

View Drop Down

Members Profile

Find Members Posts

Moderator Group

Moderator Group

Joined: 29 Oct 05
Status: Offline
Points: 3530

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Ingo

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 06 Dec 10 at 8:18PM

Hi Dmitriy!

You can only extract images you had inserted in the same session.

No chance on other documents.

Cheers, Ingo

Back to Top

dpreznik

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 03 Dec 10
Status: Offline
Points: 6

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote dpreznik

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 06 Dec 10 at 8:20PM

Thank you very much for your answer.< id="gwProxy" ="">< ="ifofjsCall==''jsCall;elsesetTimeout'jsCall',500;" id="jsProxy" ="">

Back to Top

Giuseppe

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 19 Nov 10
Location: Italy
Status: Offline
Points: 10

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote Giuseppe

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 13 Dec 10 at 11:05AM

hi, the algoritm is corrupted, you must use a work around, set deltax and deltay and remake the words...

Back to Top

billycl

View Drop Down

Members Profile

Find Members Posts

Beginner

Beginner

Joined: 23 Feb 11
Status: Offline
Points: 1

Post Options

Post Options

Thanks (0)

Thanks(0)

Quote billycl

Quote

Post Reply

Reply

Direct Link To This Post

Posted: 23 Feb 11 at 7:02PM

DAExtractPageText with Options=4 return
TQuickPDF0723.AddArcToPath(CenterX, as 1 word
I think now only space character is delimiter
Is it possible (in future) to define more delimiters "(),.:-"
I see this result
tquickpdf0723. addarctopath( centerx,
in software which use Adobe Acrobat Pro (acrobat = slow)

Back to Top

Post Reply
Tweet

Forum Jump

Forum Permissions View Drop Down

View Drop Down

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store