Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
ExtractFilePageText - Options 0 and 8 |
Post Reply |
Author | |
mLipok
Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453 |
Post Options
Thanks(0)
Posted: 02 Jul 14 at 12:49PM |
In some cases I have issue like this: I have PDF scaned and OCR with FineReader Recognition Server 3.. there is something like this blabla TEXT1 TEXT2 blablabla .... .... TEXT1 TEXT2 .... .... When I use option 8 then I get: .... .... TEXT1 TEXT2 .... .... I need to use option 8 because this option give me all content. But I want to get text in this same line like in option 0. |
|
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600 |
|
mLipok
Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453 |
Post Options
Thanks(0)
|
I need this because of this:
http://www.quickpdf.org/forum/extractfilepagetext-strange-behavior_topic2906.html btw. option 7 works OK. So now I have a question. What is the real difference between the option 7 and 8 ? I have observed that in the case of option 7, the result contains the indentation so that after writing the output to a file, text file, for example, is located on the right side (there are extra spaces on the left), provided that it was located in a PDF file. Or in some specific cases, option 8, gives more text than option 7? |
|
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600 |
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
The problem is most likely that the two text blocks are using a different font or could have overlapping bounding boxes. FineReader doesn't always output the cleanest text boxes. Option 0 will only work on some files. Option 8 extracts all text lines and outputs them 1 by 1. A line of text is consider a group of characters that have the same font and size and colour. You can ignore some of these options by using SetTextExtractionOptions. SetTextExtractionOptions is quite powerful and can be used to solve all sorts of complex PDF issues. Text extraction, like OCR, is not an exact science and Debenu Quick PDF Library has to make decisions about where words and linebreaks are located which requires characters to be first grouped and then analysed into words and then lines. We can get it wrong when PDF's use strange logic, fonts without any font information, fonts without a ToUnicode table, overlapping bounding boxes etc... Andrew
|
|
mLipok
Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453 |
Post Options
Thanks(0)
|
I can send you this PDF file but you must send me your public GPG key for encrypt this file.
|
|
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600 |
|
AndrewC
Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841 |
Post Options
Thanks(0)
|
Michael, You create a support case and it will only seen by support staff and can be deleted when resolved. Andrew.
|
|
mLipok
Senior Member Joined: 23 Apr 14 Location: Poland, Zabrze Status: Offline Points: 453 |
Post Options
Thanks(0)
|
I will but please understand me: I apply security procedures for the protection of personal data. Encrypt PDF files using PGP in this case is a standard option, and I can not ignore this point my client's internal rules.
|
|
Here you can find description how to test my examples:
http://www.quickpdf.org/forum/forum_posts.asp?TID=2932&PID=12600&title=drawcapturedpagematrix-matrix-howto#12600 |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store