Is posible extract Text from this PDF

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   Hello,

I'm trying extract text from the next PDF document
Tiff6.pdf -> click from download http://www.megaupload.com/?d=4WP3ZVO0

the GetPageText method always return empty strings,

The security flag value for ( 5 = Content Copying or Extraction ) is  (6 = Allowed)

Is posible extract the text?
If i open the document with acrobat reader i can saved the text from the document.

I'm using the 8.12 DLL version with Vb.NET

thanks in advance
best regards

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
bart_bender Members Profile Find Members Posts Beginner Joined: 04 Oct 11 Location: Spain Status: Offline Points: 17	Post Options Post Reply Quote bart_bender Report Post Thanks(0) Quote Reply Topic: Is posible extract Text from this PDF Posted: 15 Nov 11 at 12:45PM
	Hello, I'm trying extract text from the next PDF document Tiff6.pdf -> click from download http://www.megaupload.com/?d=4WP3ZVO0 the GetPageText method always return empty strings, The security flag value for ( 5 = Content Copying or Extraction ) is (6 = Allowed) Is posible extract the text? If i open the document with acrobat reader i can saved the text from the document. I'm using the 8.12 DLL version with Vb.NET thanks in advance best regards

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 15 Nov 11 at 6:49PM
	Hi Bart! What's the name of the pdf?! Yes, it's tiff...pdf! I don't wanna wait the seconds if the "mega-upload" starts but i think it's not possible to extract text 'cause the content was a tiff which was converted to a pdf-document (but it's still an image). That's a main problem. There are ocr-tools to add the textcontent read from the inserted image into the pdf. Cheers, Ingo

AndrewC Members Profile Find Members Posts Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841	Post Options Post Reply Quote AndrewC Report Post Thanks(0) Quote Reply Posted: 16 Nov 11 at 4:31AM
	The text extraction is working pretty well on this PDF with 8.13 beta 2. I am getting the correct string results from this PDF. This file is secured with a master password but that is not a problem for QPL 8.11. If you were using QPL 7.xx then you would need to call QP.SetAdvancePassword(""); before QP.LoadFromFile() Can you send me the source code you are using to text extraction. Andrew.

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 16 Nov 11 at 7:10AM
	Hi again! Cause Andrew told it's working i've used megaupload and waited the 45 seconds :-( and yes... he's right i can extract the textcontent, too. I've tried QP 7.26 and it works without SetAdvancePassword. Only Load, decrypt and extract and it works. Cheers, Ingo

bart_bender Members Profile Find Members Posts Beginner Joined: 04 Oct 11 Location: Spain Status: Offline Points: 17	Post Options Post Reply Quote bart_bender Report Post Thanks(0) Quote Reply Posted: 17 Nov 11 at 12:09PM
	Hello Again, Andrew, Ingo thanks for your help I'm using the 8.12 version Andrew. This is a sample of the code that i'm using. Private Function pGetPDFContent(ByVal documento As MemoryStream, Optional ByVal Password As String = "") As String documento.Seek(0, SeekOrigin.Begin) Dim docid As Integer = Qp.LoadFromString(documento.GetBuffer, Password) docid = Qp.SelectedDocument Return pGetPDFContent(docid) End Function Private Function pGetPDFContent(ByVal docId As Integer, Optional ByVal CloseDoc As Boolean = True) As String Qp.SelectDocument(docId) Dim salida As String = "" For i = 1 To Qp.PageCount Qp.SelectPage(i) Qp.RotatePage(-Qp.PageRotation) salida &= Qp.GetPageText(0) Next If CloseDoc Then Qp.RemoveDocument(docId) End If Return salida End Function

AndrewC Members Profile Find Members Posts Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841	Post Options Post Reply Quote AndrewC Report Post Thanks(0) Quote Reply Posted: 17 Nov 11 at 12:19PM
	1. Can you check the value of docid to make sure it is not 0. 2. You shouldn't need the QP.RotatePage. I suspect that the document is not being loaded correctly. If the document is loaded correctly then QP.PageCount should be 121. Andrew.

bart_bender Members Profile Find Members Posts Beginner Joined: 04 Oct 11 Location: Spain Status: Offline Points: 17	Post Options Post Reply Quote bart_bender Report Post Thanks(0) Quote Reply Posted: 17 Nov 11 at 1:17PM
	Yes, the document load correcly and the page count is 121 but the GetPageText return "" with or without rotation

AndrewC Members Profile Find Members Posts Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841	Post Options Post Reply Quote AndrewC Report Post Thanks(0) Quote Reply Posted: 25 Nov 11 at 6:22AM
	GetPageText(0) uses a very simple algorithm to extract text and is not suitable for all documents. Can you change the code to GetPageText(3) and see if the text is extracted correctly. We have just re added GetPageText(1) which uses the more complex text extraction options but only outputs the raw text strings similar to option 0. This will be released with the 8.13 Final Release due to be released very soon. Andrew.

bart_bender Members Profile Find Members Posts Beginner Joined: 04 Oct 11 Location: Spain Status: Offline Points: 17	Post Options Post Reply Quote bart_bender Report Post Thanks(0) Quote Reply Posted: 25 Nov 11 at 7:17AM
	Thanks Andrew

AndrewC Members Profile Find Members Posts Moderator Group Joined: 08 Dec 10 Location: Geelong, Aust Status: Offline Points: 841	Post Options Post Reply Quote AndrewC Report Post Thanks(0) Quote Reply Posted: 15 Feb 12 at 6:21AM
	Most of the professional PDF tools such as Acrobat Pro, Nitro Professional, Foxit Phantom have inbuilt OCR engines that create searchable text based on the OCR results. Andrew.