Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - ExtractFilePageText - Options 0 and 8
  FAQ FAQ  Forum Search   Register Register  Login Login

ExtractFilePageText - Options 0 and 8

 Post Reply Post Reply
Author
Message
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 453
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Topic: ExtractFilePageText - Options 0 and 8
    Posted: 02 Jul 14 at 12:49PM
In some cases I have issue like this:

I have PDF scaned and OCR with FineReader Recognition Server 3..
there is something like this

blabla
TEXT1 TEXT2
blablabla

When I use option 0 then I get:
....
....
TEXT1 TEXT2
....
....


When I use option 8 then I get:
....
....
TEXT1
TEXT2
....
....

I need to use option 8 because this option give me all content.
But I want to get text in this same line like in option 0.

Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 453
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Posted: 02 Jul 14 at 1:04PM
I need this because of this:
http://www.quickpdf.org/forum/extractfilepagetext-strange-behavior_topic2906.html

btw.
option 7 works OK.

So now I have a question.

What is the real difference between the option 7 and 8 ? 

I have observed that in the case of option 7, the result contains the indentation so that after writing the output to a file, text file, for example, is located on the right side (there are extra spaces on the left), provided that it was located in a PDF file. 


Or in some specific cases, option 8, gives more text than option 7?

Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 03 Jul 14 at 7:02AM

We would need to see the original PDF file.

The problem is most likely that the two text blocks are using a different font or could have overlapping bounding boxes. FineReader doesn't always output the cleanest text boxes.

Option 0 will only work on some files.  Option 8 extracts all text lines and outputs them 1 by 1.  A line of text is consider a group of characters that have the same font and size and colour.  You can ignore some of these options by using SetTextExtractionOptions.

SetTextExtractionOptions is quite powerful and can be used to solve all sorts of complex PDF issues.  

Text extraction, like OCR, is not an exact science and Debenu Quick PDF Library has to make decisions about where words and linebreaks are located which requires characters to be first grouped and then analysed into words and then lines.  We can get it wrong when PDF's use strange logic, fonts without any font information, fonts without a ToUnicode table, overlapping bounding boxes etc...


Andrew
Back to Top
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 453
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Posted: 03 Jul 14 at 7:36AM
I can send you this PDF file but you must send me your public GPG key for encrypt this file.

Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 03 Jul 14 at 7:41AM

Michael,
You create a support case and it will only seen by support staff and can be deleted when resolved.

Andrew.
Back to Top
mLipok View Drop Down
Senior Member
Senior Member
Avatar

Joined: 23 Apr 14
Location: Poland, Zabrze
Status: Offline
Points: 453
Post Options Post Options   Thanks (0) Thanks(0)   Quote mLipok Quote  Post ReplyReply Direct Link To This Post Posted: 03 Jul 14 at 7:52AM
I will but please understand me: I apply security procedures for the protection of personal data. Encrypt PDF files using PGP in this case is a standard option, and I can not ignore this point my client's internal rules.
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store