Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
comment on text extraction |
Post Reply |
Author | |
ukobsa
Senior Member Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
Post Options
Thanks(0)
Posted: 12 Jul 06 at 6:46AM |
Hi,
just a short comment on text extraction using GetPageText with option 4: I wrote a one liner: Test "This, that" using LaTeX and tried to extract text with option 4 from the generated PDF: this results in the following: "IUQMMW+CMR10",#000000,9.96,133.7680,705.1921,209.4270,705.1921,209.4270,714.0393,133.7680,714.0393,"TTesTest Test”ThiTest”ThisTest”This, Test”This,thTest”This,that”" After some debugging I find, that GetPageText has problems to extract words when they are not defined as simple "[(text)]Tj" but as "[(t) 83 (ex) 83 (t)]Tj" (with individual glyph positioning) which was the case in my example: [(T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at")]TJ So I think, that option 3 of GetPageText seems to be the most usable one. If someone has a fix for this problem, let me know, please. HTH, Ulrich Edited by ukobsa |
|
ukobsa
Senior Member Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
Post Options
Thanks(0)
|
Update:
I have made a small change on source so that the result becomes at least as good as with option 3. It cannot assured that the text is always splitt to single words but I can avoid the partially doubled entries. The remaining problem: How can I determine if (T)83(es)-1(t)-333("Thi)1(s)-1(,)-333(th)1(at") are one, two or more words? Is it allowed to post the changes here on this forum? best regards, Ulrich |
|
DELBEKE
Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151 |
Post Options
Thanks(0)
|
Please, post it i am waiting these from years. As far i can see , postive index are used for the beginning of the word
|
|
ukobsa
Senior Member Joined: 29 May 06 Location: Germany Status: Offline Points: 115 |
Post Options
Thanks(0)
|
Ok, here it is:
Description: With the following change option 4 results in correct text when having text like [(te) 83 (s) -1 (t) -333 (thi) 82(s)]TJ The original version results in "tetestest thithis" while the new version result in "test this". Restriction: having text like the above it is not possible to divide it to single words. So it works more like option 3. search for UKO in the following code (2 lines) and add the lines to your source. Unit: uPDFRenderer Method: SubRender local SubMethod: ShowText procedure ShowText; var X: Integer; C: Char; M: TPDFXForm; OldM: TPDFXForm; DX: Double; OI: Integer; Text: string; W: Double; CW: Word; WI: Integer; DXS: Double; MapText: string; TestP1: TPDFXFormPoint; TestP2: TPDFXFormPoint; RealTextSize: Double; GC: TPDFGenericCanvas; CIDToGIDMap: string; CN: string; MappedText: string; ThisMappedText: string; MatrixScale: Double; UC: Word; UX: Integer; UFound: Boolean; begin MatrixScale := 1000; MappedText := ''; CIDToGIDMap := FFontCol.FoundFontData.CIDToGIDMap; if FDestination = rdEPS then GC := FEPS else GC := Picasso; GC.BeginPath; if GS.TextSize <> 0 then begin if Assigned(FFontCol.FoundFontData.Rasterizer) then FFontCol.FoundFontData.Rasterizer.RenderingMode := GS.TextRenderingMode; SetFill(pfUnknown); SetPen; OI := Operands.Count - ArrayCount; SelectFont; OldM := CanvasMat; try CombinePDFXForm(CanvasMat, GS.TM, CanvasMat); if FFontCol.FoundFontData.Rasterizer is TPDFType3Rasterizer then begin MatrixScale := 1; CombinePDFXForm(CanvasMat, TPDFType3Rasterizer( FFontCol.FoundFontData.Rasterizer).FontMatrix, CanvasMat); TPDFType3Rasterizer(FFontCol.FoundFontData.Rasterizer).FillColor := GS.FillColor; end; Mat(M, GS.TextSize / MatrixScale * GS.TextScaling / 100, 0, 0, GS.TextSize / MatrixScale, 0, 0); CombinePDFXForm(CanvasMat, M, CanvasMat); TestP1.X := 0; TestP1.Y := 0; TestP2.X := 0; TestP2.Y := MatrixScale; TestP1 := DoPDFXForm(CanvasMat, TestP1); TestP2 := DoPDFXForm(CanvasMat, TestP2); RealTextSize := Sqrt(Sqr(TestP2.X - TestP1.X) + Sqr(TestP2.Y - TestP1.Y)); DX := 0; repeat Text := Operands[OI]; if (Copy(Text, 1, 1) <> '(') and (Copy(Text, 1, 1) <> '<') then begin DX := DX - TazzToFloat(Text) * MatrixScale / 1000; end else begin if (Copy(Text, 1, 1) = '(') then begin Text := FStructure.DecodeString(Text); end else if (Copy(Text, 1, 1) = '<') then begin Text := FStructure.DecodeHex(Text); end; if FFontCol.FoundFontData.IsComposite then begin DXS := DX; for X := 1 to Length(Text) div 2 do begin CW := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]); if CIDToGIDMap <> '' then begin if (CW * 2) < Length(CIDToGIDMap) then begin CN := 'GID:' + IntToStr(Ord(CIDToGIDMap[CW * 2 + 1]) * 256 + Ord(CIDToGIDMap[CW * 2 + 2])); end; end else CN := 'GID:' + IntToStr(CW); if Assigned(FFontCol.FoundFontData.CIDWidths) then begin WI := FFontCol.FoundFontData.CIDWidths.IndexOf('CID:' + IntToStr(CW)); if WI >= 0 then begin WI := Integer(FFontCol.FoundFontData.CIDWidths.Objects[WI]); {***Th W := WI * GS.TextScaling / 100; ***} W := WI{***Th W ***} + GS.CharSpacing * MatrixScale / Abs(GS.TextSize); end else W := MatrixScale; end else W := MatrixScale; if FDestination <> rdTextFunnel then begin if Assigned(FFontCol.FoundFontData.Rasterizer) then with FFontCol.FoundFontData do Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale / Abs(GS.TextSize), GC, CanvasMat, CN); end; if (FDestination = rdTextFunnel) and FFunnel.SplitWords then begin ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(CW)]; if (ThisMappedText = ' ') and (MappedText <> '') then begin FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); MappedText := ''; FFunnel.SetNextMatch(False); DXS := DX + W; end else MappedText := MappedText + ThisMappedText; end; DX := DX + W; end; FFunnel.SetNextMatch(True); if (FDestination = rdTextFunnel) and FFunnel.SplitWords then begin FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); MappedText := ''; // UKO Reset (necessary for positioned text glyphs) end; if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then begin MapText := ''; //MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2])]; for X := 1 to Length(Text) div 2 do begin UC := Ord(Text[X * 2 - 1]) * 256 + Ord(Text[X * 2]); UX := 0; UFound := False; while (not UFound) and (UX < Length(FFontCol.FoundFontData.DisplayCS2)) do begin if (UC >= FFontCol.FoundFontData.DisplayCS2[X].StartCode) and (UC <= FFontCol.FoundFontData.DisplayCS2[X].EndCode) then begin UC := FFontCol.FoundFontData.DisplayCS2[X].ResultCode + UC - FFontCol.FoundFontData.DisplayCS2[X].StartCode; UFound := True; end else Inc(UX); end; if UFound then MapText := MapText + WideChar(UC); end; FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); end; end else begin DXS := DX; for X := 1 to Length(Text) do begin C := Text[X]; W := (Widths[Ord(C)] * Abs(GS.TextSize) {***Th GS.TextScaling / 100 ***} / Abs(GS.TextSize)); if Assigned(FFontCol.FoundFontData.Rasterizer) then W := W / FFontCol.FoundFontData.Rasterizer.FontMatrixScaling; W := W + GS.CharSpacing * MatrixScale / Abs(GS.TextSize); if C = #32 then W := W + GS.WordSpacing * MatrixScale / Abs(GS.TextSize); if FDestination <> rdTextFunnel then begin if Assigned(FFontCol.FoundFontData.Rasterizer) then with FFontCol.FoundFontData do Rasterizer.RenderToCanvas(DX, GS.TextRise * MatrixScale / Abs(GS.TextSize), GC, CanvasMat, Encoding[Ord(C)]); end; if (FDestination = rdTextFunnel) and FFunnel.SplitWords then begin ThisMappedText := FFontCol.FoundFontData.DisplayCS[Ord(Text[X])]; if (ThisMappedText = ' ') and (MappedText <> '') then begin FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); MappedText := ''; FFunnel.SetNextMatch(False); DXS := DX + W; end else MappedText := MappedText + ThisMappedText; end; DX := DX + W; end; FFunnel.SetNextMatch(True); if (FDestination = rdTextFunnel) and FFunnel.SplitWords then begin FFunnel.AddText(CanvasMat, MappedText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); MappedText := ''; // UKO Reset (necessary for positioned text glyphs) end; if (FDestination = rdTextFunnel) and (not FFunnel.SplitWords) then begin MapText := ''; for X := 1 to Length(Text) do MapText := MapText + FFontCol.FoundFontData.DisplayCS[Ord(Text[X])]; FFunnel.AddText(CanvasMat, MapText, HTMLColor(GS.FillColor), FFontCol.FoundFontData.FontName, RealTextSize, DXS, GS.TextRise * MatrixScale / Abs(GS.TextSize) + FFontCol.FoundFontData.Descent, DX - DXS, FFontCol.FoundFontData.Ascent - FFontCol.FoundFontData.Descent); end; end; end; Inc(OI); until OI = Operands.Count; Mat(M, 1, 0, 0, 1, DX * Abs(GS.TextSize) / MatrixScale, 0); CombinePDFXForm(GS.TM, M, GS.TM); finally CanvasMat := OldM; end; end; ArrayCount := 0; if FDestination = rdEPS then begin if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then begin FEPS.PSSetColor(GS.FillColorEPS); FEPS.PSFill(epsFillModeNonZeroWinding); end; if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then begin FEPS.PSSetColor(GS.StrokeColorEPS); FEPS.PSStroke; end; end else begin if (GS.TextRenderingMode = 1) or (GS.TextRenderingMode = 2) or (GS.TextRenderingMode = 5) or (GS.TextRenderingMode = 6) then SetPen; if (GS.TextRenderingMode = 0) or (GS.TextRenderingMode = 2) or (GS.TextRenderingMode = 4) or (GS.TextRenderingMode = 6) then SetFill(pfWinding); if GS.TextRenderingMode >= 4 then Picasso.SetClippingPath(pfWinding); case GS.TextRenderingMode of 0: Picasso.FillPath; 1: Picasso.StrokePath; 2: Picasso.StrokeAndFillPath; 4: Picasso.FillPath; 5: Picasso.StrokePath; 6: Picasso.StrokeAndFillPath; 7: Picasso.NullPath; end; end; end; So please test it and if it works ok, can it be included in Version 5.15 ? best regards, Ulrich |
|
DELBEKE
Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151 |
Post Options
Thanks(0)
|
thank you very much
Best regards |
|
DELBEKE
Debenu Quick PDF Library Expert Joined: 31 Oct 05 Location: France Status: Offline Points: 151 |
Post Options
Thanks(0)
|
Good job Working fine for me. The function can be improved, surely, but working better than the old version wich was too bugged before. thanks very much Ukobsa :) |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store