I need help - I can help - Is posible extract Text from this PDF

Print Page | Close Window

Is posible extract Text from this PDF

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2038
Printed Date: 19 Mar 26 at 4:54AM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Is posible extract Text from this PDF

Posted By: bart_bender
Subject: Is posible extract Text from this PDF
Date Posted: 15 Nov 11 at 12:45PM

Hello,

I'm trying extract text from the next PDF document
Tiff6.pdf -> click from download http://www.megaupload.com/?d=4WP3ZVO0

the GetPageText method always return empty strings,

The security flag value for ( 5 = Content Copying or Extraction ) is (6 = Allowed)

Is posible extract the text?
If i open the document with acrobat reader i can saved the text from the document.

I'm using the 8.12 DLL version with Vb.NET

thanks in advance
best regards

Replies:

Posted By: Ingo
Date Posted: 15 Nov 11 at 6:49PM

Hi Bart!

What's the name of the pdf?!
Yes, it's tiff...pdf!
I don't wanna wait the seconds if the "mega-upload" starts but i think it's not possible to extract text 'cause the content was a tiff which was converted to a pdf-document (but it's still an image).
That's a main problem. There are ocr-tools to add the textcontent read from the inserted image into the pdf.

Cheers, Ingo

Posted By: AndrewC
Date Posted: 16 Nov 11 at 4:31AM

The text extraction is working pretty well on this PDF with 8.13 beta 2. I am getting the correct string results from this PDF.

This file is secured with a master password but that is not a problem for QPL 8.11. If you were using QPL 7.xx then you would need to call QP.SetAdvancePassword(""); before QP.LoadFromFile()

Can you send me the source code you are using to text extraction.

Andrew.

Posted By: Ingo
Date Posted: 16 Nov 11 at 7:10AM

Hi again!

Cause Andrew told it's working i've used megaupload

and waited the 45 seconds :-(

and yes... he's right i can extract the textcontent, too.

I've tried QP 7.26 and it works without SetAdvancePassword.

Only Load, decrypt and extract and it works.

Cheers, Ingo

Posted By: bart_bender
Date Posted: 17 Nov 11 at 12:09PM

Hello Again,
Andrew, Ingo thanks for your help

I'm using the 8.12 version Andrew.

This is a sample of the code that i'm using.

    Private Function pGetPDFContent(ByVal documento As MemoryStream, Optional ByVal Password As String = "") As String
        documento.Seek(0, SeekOrigin.Begin)
        Dim docid As Integer = Qp.LoadFromString(documento.GetBuffer, Password)
        docid = Qp.SelectedDocument
        Return pGetPDFContent(docid)
    End Function

    Private Function pGetPDFContent(ByVal docId As Integer, Optional ByVal CloseDoc As Boolean = True) As String
        Qp.SelectDocument(docId)
        Dim salida As String = ""
        For i = 1 To Qp.PageCount
            Qp.SelectPage(i)
            Qp.RotatePage(-Qp.PageRotation)
            salida &= Qp.GetPageText(0)
        Next
        If CloseDoc Then
            Qp.RemoveDocument(docId)
        End If
        Return salida
    End Function

Posted By: AndrewC
Date Posted: 17 Nov 11 at 12:19PM

1. Can you check the value of docid to make sure it is not 0.

2. You shouldn't need the QP.RotatePage.

I suspect that the document is not being loaded correctly. If the document is loaded correctly then QP.PageCount should be 121.

Andrew.

Posted By: bart_bender
Date Posted: 17 Nov 11 at 1:17PM

Yes, the document load correcly and the page count is 121 but the GetPageText return "" with or without rotation

Posted By: AndrewC
Date Posted: 25 Nov 11 at 6:22AM

GetPageText(0) uses a very simple algorithm to extract text and is not suitable for all documents.

Can you change the code to GetPageText(3) and see if the text is extracted correctly. We have just re added GetPageText(1) which uses the more complex text extraction options but only outputs the raw text strings similar to option 0. This will be released with the 8.13 Final Release due to be released very soon.

Andrew.

Posted By: bart_bender
Date Posted: 25 Nov 11 at 7:17AM

Thanks Andrew

Posted By: AndrewC
Date Posted: 15 Feb 12 at 6:21AM

Most of the professional PDF tools such as Acrobat Pro, Nitro Professional, Foxit Phantom have inbuilt OCR engines that create searchable text based on the OCR results.

Andrew.