DAExtractPageText losing characters

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   I am trying to extract the text from a PDF and most of it works fine but occasionally letters are missed in the extract.    This appears to be because the PDF is using octal codes for the characters.
This is the text which should be produced and is rendered correctly by DARenderPageToString:
Top Line ->  6 fyodor dostoyevsky
Space -> 
Next Line -> flowers in a stuffy city apartment, but because everybody is
Here is the command extract for this same section
BT
0 0 0 1 k
/GS0 gs
/T1_0 1 Tf
8.25 0 0 8.25 262.7389 564.0571 Tm
[(\036)-100(\035)-55(\034)-100(\033)-100(\034)-100(\032)-100( )-100(\033)-100(\034)-100(\031)-100(\030)-82(\034)-45(\035)-100(\027)-100(\026)-100(\031)-100(\025)-100(\035)]TJ
8.5 0 0 8.5 83.3622 564.0571 Tm
(\f)Tj
10.5104 0 0 10.25 83.3622 543.058 Tm
[(\023)10(o)10(w)10(e)10(r)10(s)10( )-125(i)10(n)10( )-126(a)10( )-125(s)10(t)10(u)10(f)10(f)10(y)10( )-125(c)10(i)10(t)10(y)10( )-126(a)10(p)10(a)10(r)10(t)10(m)10(e)10(n)10(t)10(,)47( )-126(b)10(u)10(t)10( )-125(b)10(e)10(c)10(a)10(u)10(s)10(e)10( )-125(e)10(v)10(e)10(r)10(y)10(b)10(o)10(d)10(y)10( )-125(i)10(s )]TJ
The DAExtractPageText (option 3) returns 2 lines with an empty string and a space (or perhaps 2) for the Top Line and misses out the "fl" from the begining of the Next Line.
Is there any way I can correct this?

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Mike4ql Members Profile Find Members Posts Beginner Joined: 26 Jul 10 Status: Offline Points: 5	Post Options Post Reply Quote Mike4ql Report Post Thanks(0) Quote Reply Topic: DAExtractPageText losing characters Posted: 08 Oct 10 at 11:39AM
	I am trying to extract the text from a PDF and most of it works fine but occasionally letters are missed in the extract. This appears to be because the PDF is using octal codes for the characters. This is the text which should be produced and is rendered correctly by DARenderPageToString: Top Line -> 6 fyodor dostoyevsky Space -> Next Line -> flowers in a stuffy city apartment, but because everybody is Here is the command extract for this same section BT 0 0 0 1 k /GS0 gs /T1_0 1 Tf 8.25 0 0 8.25 262.7389 564.0571 Tm [(\036)-100(\035)-55(\034)-100(\033)-100(\034)-100(\032)-100( )-100(\033)-100(\034)-100(\031)-100(\030)-82(\034)-45(\035)-100(\027)-100(\026)-100(\031)-100(\025)-100(\035)]TJ 8.5 0 0 8.5 83.3622 564.0571 Tm (\f)Tj 10.5104 0 0 10.25 83.3622 543.058 Tm [(\023)10(o)10(w)10(e)10(r)10(s)10( )-125(i)10(n)10( )-126(a)10( )-125(s)10(t)10(u)10(f)10(f)10(y)10( )-125(c)10(i)10(t)10(y)10( )-126(a)10(p)10(a)10(r)10(t)10(m)10(e)10(n)10(t)10(,)47( )-126(b)10(u)10(t)10( )-125(b)10(e)10(c)10(a)10(u)10(s)10(e)10( )-125(e)10(v)10(e)10(r)10(y)10(b)10(o)10(d)10(y)10( )-125(i)10(s )]TJ The DAExtractPageText (option 3) returns 2 lines with an empty string and a space (or perhaps 2) for the Top Line and misses out the "fl" from the begining of the Next Line. Is there any way I can correct this?

Mike4ql Members Profile Find Members Posts Beginner Joined: 26 Jul 10 Status: Offline Points: 5	Post Options Post Reply Quote Mike4ql Report Post Thanks(0) Quote Reply Posted: 12 Oct 10 at 7:15PM
	Has nobody else seen this? It seems to be a fundamental flaw preventing anyone from using PDF Quick to extract text from a PDF. I would be grateful for any suggestions. Mike