Print Page | Close Window

Text lines assembling in VB6

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: Sample Code
Forum Description: Share Debenu Quick PDF Library sample code with other forum members
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1656
Printed Date: 22 Nov 24 at 7:09PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Text lines assembling in VB6
Posted By: alinux
Subject: Text lines assembling in VB6
Date Posted: 26 Nov 10 at 7:40PM
It's a basic sample of text lines assembling  from GetPageText(4) function result; the results depend on quality of scan & OCR process.
In the case of tables, the OCR engine may detect, "read" & process the tables by line or by column independent of you so you'll need a sort array function for sorting the page lines array by y coordinate of each line.


Private Function full_lines(get_page_text As String) As String

'page text lines array (0,N) - y1 or y2 word coordinate, (1,N) - line words
Dim dmp_pge() As String

ReDim dmp_pge(1, 0)

'page words array
dmp_lns = Split(get_page_text, vbCrLf)

For i = 0 To UBound(dmp_lns)
    If dmp_lns(i) <> "" Then

    'word line array
        dmp_wrd = Split(dmp_lns(i), ",")

        flag_exist = False
        For j = UBound(dmp_pge, 2) To 0 Step -1
            If dmp_wrd(4) = dmp_pge(0, j) Then

               'add next word in the same line
                If dmp_pge(1, j) <> "" Then dmp_pge(1, j) = dmp_pge(1, j) & " " & dmp_wrd(UBound(dmp_wrd)) Else dmp_pge(1, j) = dmp_wrd(UBound(dmp_wrd))
                flag_exist = True
                Exit For
            End If
            DoEvents
        Next
        If Not flag_exist Then
            If dmp_pge(1, UBound(dmp_pge, 2)) <> "" Then
                ReDim Preserve dmp_pge(1, UBound(dmp_pge, 2) + 1)
            End If

           'add y1 word(line) coordinate & first word of the new line
            dmp_pge(0, UBound(dmp_pge, 2)) = dmp_wrd(4)
            dmp_pge(1, UBound(dmp_pge, 2)) = dmp_wrd(UBound(dmp_wrd))
        End If
    End If
    DoEvents
Next

'need sort array function in the case of tables (the OCR engine may identify & "read" the table by column so you must sort the lines array by the y coordinate - see dmp_pge definition)
'sort_array dmp_pge

For i = 0 To UBound(dmp_pge, 2)
    If full_lines = "" Then full_lines = dmp_pge(1, i) Else full_lines = full_lines & vbCrLf & dmp_pge(1, i)
    DoEvents
Next
full_lines = Replace(full_lines, """", "")

End Function



Replies:
Posted By: Ingo
Date Posted: 26 Nov 10 at 8:13PM
Hi Alinux!

Thanks for this.
I think here are many users looking for this option in QuickPDF.
Your sample could be a starting point for options like "keep
the original layout in txt, too"...
Thanks for sharing with us!

Cheers, Ingo


Posted By: Rowan
Date Posted: 27 Nov 10 at 9:20AM
Great job.


Posted By: alinux
Date Posted: 27 Nov 10 at 11:41AM
Thanks guys.

Cheers,


-------------
alinux



Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk