Text lines assembling in VB6
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: Sample Code
Forum Description: Share Debenu Quick PDF Library sample code with other forum members
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1656
Printed Date: 22 Nov 24 at 7:09PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Text lines assembling in VB6
Posted By: alinux
Subject: Text lines assembling in VB6
Date Posted: 26 Nov 10 at 7:40PM
It's a basic sample of text lines assembling from GetPageText(4) function result; the results depend on quality of scan & OCR process. In the case of tables, the OCR engine may detect, "read" & process the tables by line or by column independent of you so you'll need a sort array function for sorting the page lines array by y coordinate of each line.
Private Function full_lines(get_page_text As String) As String
'page text lines array (0,N) - y1 or y2 word coordinate, (1,N) - line words Dim dmp_pge() As String
ReDim dmp_pge(1, 0)
'page words array dmp_lns = Split(get_page_text, vbCrLf)
For i = 0 To UBound(dmp_lns) If dmp_lns(i) <> "" Then
'word line array dmp_wrd = Split(dmp_lns(i), ",")
flag_exist = False For j = UBound(dmp_pge, 2) To 0 Step -1 If dmp_wrd(4) = dmp_pge(0, j) Then
'add next word in the same line If dmp_pge(1, j) <> "" Then dmp_pge(1, j) = dmp_pge(1, j) & " " & dmp_wrd(UBound(dmp_wrd)) Else dmp_pge(1, j) = dmp_wrd(UBound(dmp_wrd)) flag_exist = True Exit For End If DoEvents Next If Not flag_exist Then If dmp_pge(1, UBound(dmp_pge, 2)) <> "" Then ReDim Preserve dmp_pge(1, UBound(dmp_pge, 2) + 1) End If
'add y1 word(line) coordinate & first word of the new line dmp_pge(0, UBound(dmp_pge, 2)) = dmp_wrd(4) dmp_pge(1, UBound(dmp_pge, 2)) = dmp_wrd(UBound(dmp_wrd)) End If End If DoEvents Next
'need sort array function in the case of tables (the OCR engine may identify & "read" the table by column so you must sort the lines array by the y coordinate - see dmp_pge definition) 'sort_array dmp_pge
For i = 0 To UBound(dmp_pge, 2) If full_lines = "" Then full_lines = dmp_pge(1, i) Else full_lines = full_lines & vbCrLf & dmp_pge(1, i) DoEvents Next full_lines = Replace(full_lines, """", "")
End Function
|
Replies:
Posted By: Ingo
Date Posted: 26 Nov 10 at 8:13PM
Hi Alinux!
Thanks for this. I think here are many users looking for this option in QuickPDF. Your sample could be a starting point for options like "keep the original layout in txt, too"... Thanks for sharing with us!
Cheers, Ingo
|
Posted By: Rowan
Date Posted: 27 Nov 10 at 9:20AM
Posted By: alinux
Date Posted: 27 Nov 10 at 11:41AM
Thanks guys.
Cheers,
------------- alinux
|
|