Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!

Debenu Quick PDF Library - PDF SDK Community Forum Homepage
Forum Home Forum Home > For Users of the Library > I need help - I can help
  New Posts New Posts RSS Feed - Is posible extract Text from this PDF
  FAQ FAQ  Forum Search   Register Register  Login Login

Is posible extract Text from this PDF

 Post Reply Post Reply
Author
Message
bart_bender View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Oct 11
Location: Spain
Status: Offline
Points: 17
Post Options Post Options   Thanks (0) Thanks(0)   Quote bart_bender Quote  Post ReplyReply Direct Link To This Post Topic: Is posible extract Text from this PDF
    Posted: 15 Nov 11 at 12:45PM
Hello,

I'm trying extract text from the next PDF document
Tiff6.pdf -> click from download http://www.megaupload.com/?d=4WP3ZVO0

the GetPageText method always return empty strings,

The security flag value for ( 5 = Content Copying or Extraction ) is  (6 = Allowed)

Is posible extract the text?
If i open the document with acrobat reader i can saved the text from the document.

I'm using the 8.12 DLL version with Vb.NET

thanks in advance
best regards


Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 15 Nov 11 at 6:49PM
Hi Bart!

What's the name of the pdf?!
Yes, it's tiff...pdf!
I don't wanna wait the seconds if the "mega-upload" starts but i think it's not possible to extract text 'cause the content was a tiff which was converted to a pdf-document (but it's still an image).
That's a main problem. There are ocr-tools to add the textcontent read from the inserted image into the pdf.

Cheers, Ingo
Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 11 at 4:31AM
The text extraction is working pretty well on this PDF with 8.13 beta 2.  I am getting the correct string results from this PDF.  

This file is secured with a master password but that is not a problem for QPL 8.11.  If you were using QPL 7.xx then you would need to call QP.SetAdvancePassword(""); before QP.LoadFromFile()

Can you send me the source code you are using to text extraction.

Andrew.
Back to Top
Ingo View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 29 Oct 05
Status: Offline
Points: 3524
Post Options Post Options   Thanks (0) Thanks(0)   Quote Ingo Quote  Post ReplyReply Direct Link To This Post Posted: 16 Nov 11 at 7:10AM
Hi again!
 
Cause Andrew told it's working i've used megaupload
and waited the 45 seconds :-(
and yes... he's right i can extract the textcontent, too.
I've tried QP 7.26 and it works without SetAdvancePassword.
Only Load, decrypt and extract and it works.
 
Cheers, Ingo
 
Back to Top
bart_bender View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Oct 11
Location: Spain
Status: Offline
Points: 17
Post Options Post Options   Thanks (0) Thanks(0)   Quote bart_bender Quote  Post ReplyReply Direct Link To This Post Posted: 17 Nov 11 at 12:09PM
Hello Again,
Andrew, Ingo thanks for your help

I'm using the 8.12 version Andrew.

This is a sample of the code that i'm using.

    Private Function pGetPDFContent(ByVal documento As MemoryStream, Optional ByVal Password As String = "") As String
        documento.Seek(0, SeekOrigin.Begin)
        Dim docid As Integer = Qp.LoadFromString(documento.GetBuffer, Password)
        docid = Qp.SelectedDocument
        Return pGetPDFContent(docid)
    End Function

    Private Function pGetPDFContent(ByVal docId As Integer, Optional ByVal CloseDoc As Boolean = True) As String
        Qp.SelectDocument(docId)
        Dim salida As String = ""
        For i = 1 To Qp.PageCount
            Qp.SelectPage(i)
            Qp.RotatePage(-Qp.PageRotation)
            salida &= Qp.GetPageText(0)
        Next
        If CloseDoc Then
            Qp.RemoveDocument(docId)
        End If
        Return salida
    End Function


Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 17 Nov 11 at 12:19PM
1. Can you check the value of docid to make sure it is not 0.
2. You shouldn't need the QP.RotatePage.

I suspect that the document is not being loaded correctly.  If the document is loaded correctly then QP.PageCount should be 121.

Andrew.
Back to Top
bart_bender View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Oct 11
Location: Spain
Status: Offline
Points: 17
Post Options Post Options   Thanks (0) Thanks(0)   Quote bart_bender Quote  Post ReplyReply Direct Link To This Post Posted: 17 Nov 11 at 1:17PM
Yes, the document load correcly and the page count is 121 but the  GetPageText return "" with or without rotation

Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 25 Nov 11 at 6:22AM
GetPageText(0) uses a very simple algorithm to extract text and is not suitable for all documents.

Can you change the code to GetPageText(3) and see if the text is extracted correctly.  We have just re added GetPageText(1) which uses the more complex text extraction options but only outputs the raw text strings similar to option 0.  This will be released with the 8.13 Final Release due to be released very soon.

Andrew.
Back to Top
bart_bender View Drop Down
Beginner
Beginner
Avatar

Joined: 04 Oct 11
Location: Spain
Status: Offline
Points: 17
Post Options Post Options   Thanks (0) Thanks(0)   Quote bart_bender Quote  Post ReplyReply Direct Link To This Post Posted: 25 Nov 11 at 7:17AM
Thanks Andrew

Back to Top
AndrewC View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 08 Dec 10
Location: Geelong, Aust
Status: Offline
Points: 841
Post Options Post Options   Thanks (0) Thanks(0)   Quote AndrewC Quote  Post ReplyReply Direct Link To This Post Posted: 15 Feb 12 at 6:21AM
Most of the professional PDF tools such as Acrobat Pro, Nitro Professional, Foxit Phantom have inbuilt OCR engines that create searchable text based on the OCR results.

Andrew.

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 11.01
Copyright ©2001-2014 Web Wiz Ltd.

Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. AboutContactBlogSupportOnline Store