Debenu Quick PDF Library - PDF SDK Community Forum : ExtractFilePageText

Debenu Quick PDF Library - PDF SDK Community Forum : ExtractFilePageText http://www.quickpdf.org/forum/ Copyright (c) 2006-2013 Web Wiz Forums - All Rights Reserved. Sun, 05 Apr 2026 13:12:22 +0000 Wed, 16 Aug 2017 08:18:13 +0000 http://blogs.law.harvard.edu/tech/rss Web Wiz Forums 11.01 360 www.quickpdf.org/forum/RSS_post_feed.asp?TID=3492 <![CDATA[Debenu Quick PDF Library - PDF SDK Community Forum]]> http://www.quickpdf.org/forum/forum_images/QPDF_Forum_Title.png http://www.quickpdf.org/forum/ <![CDATA[ExtractFilePageText : Hi Reg,I've made some tests...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13891.html#13891 Author: Ingo
Subject: 3492
Posted: 16 Aug 17 at 8:18AM

Hi Reg,

I've made some tests with the pdf...
The source is from BricsCAD.
It's converted from dwg-format.
I myself have the same probs while extracting text.
Perhaps a codepage problem?
Rendering works but there are few text parts overlaying each other.
BTW: At the end there's a malformed xref table.

With google i've found many community-posts having to do with problems using the direct pdf-export-function from BricsCAD.
Another thing: Encoding is identity-H - this can be a problem, too.
My advice you won't get a proper textextraction with pdf-documents from the same source. Sorry. Anyway... If you'll succeed please let us know with your "how to...". Thanks.

]]> Wed, 16 Aug 2017 08:18:13 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13891.html#13891 <![CDATA[ExtractFilePageText : http://elcc.se/download/ExtractFilePageText.zipI...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13890.html#13890 Author: REGH
Subject: 3492
Posted: 16 Aug 17 at 6:47AM

http://elcc.se/download/ExtractFilePageText.zip
I would like to use option=3 to get the bounding box coordinates for creation of links, but since the text with this option is gibberish, I tried in addition to use option=2 and merging them together, but I can't find an obvious way to match the results with each other in order to get a result of bounding box coordinates and readable text...

]]> Wed, 16 Aug 2017 06:47:59 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13890.html#13890 <![CDATA[ExtractFilePageText : Option 2 is like option 3 but...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13889.html#13889 Author: Ingo
Subject: 3492
Posted: 15 Aug 17 at 7:37PM

Option 2 is like option 3 but a bit more accurate in extracting.

Don't mix the options. Each resulting content can differ a little bit (otherwise the two options make no sense) and this can lead to "nearly" duplicate content.

At the top you've used the DASetTextExtractionOptions - this will work only with DA-functions! Don't mix both types of functions!

Your hoster wants my email-adress - he won't get it ;-)

If the ttf-font is not common and if it's not embedded this can lead in bad extraction, too.

]]> Tue, 15 Aug 2017 19:37:33 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13889.html#13889 <![CDATA[ExtractFilePageText : Hi Ingo,The file I'm extracting...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13888.html#13888 Author: REGH
Subject: 3492
Posted: 15 Aug 17 at 5:29PM

Hi Ingo,
The file I'm extracting texts from is created from a CAD drawing (having TTF texts).
When I tried my code, but instead used a pdf created from MS Word there is no problem.
However, this is my VB code for testing the text extraction:
    QP.UnlockKey (strLicenseKey)
    QP.DASetTextExtractionOptions 12, 0 'Include rotated texts
    QP.DASetTextExtractionOptions 8, 1 'Ignorera duplicates
    QP.DASetTextExtractionOptions 5, 1 'Sort

    For iOption = 2 To 3 Step 1
        strTmpText = QP.ExtractFilePageText("C:\Temp\N09A.pdf", "", 1, iOption)

        iOutFileNo = FreeFile
        strOutFileName = "C:\Temp\Option=" & iOption & ".txt"
        Open strOutFileName For Output As #iOutFileNo

        TextArray = Split(strTmpText, vbCr)
        For i = 0 To UBound(TextArray) - 1
            Print #iOutFileNo, TextArray(i)
        Next i

        Close #iOutFileNo
    Next iOption

One of the rows in each generated text file which seems to refer to the same text and looks like below for Option=2:
67.46,526.32,#000000,1.4,"AAAAAA+ArialNarrow","ELDU 400V A-Matning, STV AH.N09A, BIOLINJE 1, HE.B34.10.01.11"

And för Option=3 the same text is:
"AAAAAA+ArialNarrow",#000000,13.73,67.4646,523.4061,416.9278,523.4061,416.9278,523.4061,67.4646,523.4061," ? ???   ?A ? ? ? ??? ?A?    A ? ???? ??? ??? ??         "

Option=2 gives me readable text, but Option=3 doesn't.
Here's a link to a zip containing the two textfiles and the pdf used for the test.
http://www.filehosting.org/file/details/686981/ExtractFilePageText.zip

]]> Tue, 15 Aug 2017 17:29:36 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13888.html#13888 <![CDATA[ExtractFilePageText : Hi Reg,strange behavior you're...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13887.html#13887 Author: Ingo
Subject: 3492
Posted: 14 Aug 17 at 9:22PM

Hi Reg,

strange behavior you're telling from.

For me the extract functions are the most stable ones in the library.

What you should do is:

Post your relevant code snippet here - so perhaps somebody here can determine problems inside your code.

Upload the pdf you're working with anywhere to a free file hoster - so we can try own extractions to see if the problem is the pdf itself ;-)

Cheers and welcome here,

Ingo

]]> Mon, 14 Aug 2017 21:22:17 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13887.html#13887 <![CDATA[ExtractFilePageText : Hi!I'm using ExtractFilePageText...]]> http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13886.html#13886 Author: REGH
Subject: 3492
Posted: 14 Aug 17 at 5:16PM

Hi!
I'm using ExtractFilePageText trying to extract textstrings and bounding box coordinates for the texts. But when I use option 3 (Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text) I don't get the text i (human) readable format. None of the options that result in readable text gives me the bounding box coordinates. Is there a way to work around this?

I tried to make two extractions (option=2 and option=3) putting the results into two different arrays, and then merge them together. But when I use option 2 some text objects are read twice which gives me two arrays having different number of texts...

]]> Mon, 14 Aug 2017 17:16:03 +0000 http://www.quickpdf.org/forum/extractfilepagetext_topic3492_post13886.html#13886