I need help - I can help - Strange characters when using GetPageText

Print Page | Close Window

Strange characters when using GetPageText

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=1348
Printed Date: 27 Sep 24 at 3:00PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com

Topic: Strange characters when using GetPageText

Posted By: smithmarkduane
Subject: Strange characters when using GetPageText
Date Posted: 19 Feb 10 at 3:39PM

I have several PDF documents from the same source that I need to scan the contents to extract some basic text information. On some of the documents (not all) I get strange characters. For example (see value that follows Test Date):

Test Date: 1ï¿½/3/2ï¿½ï¿½9

where Test Date s/b 10/3/2009. In fact it seems that in all cases where I see a problem it is when the 0 character should appear. Any ideas how to resolve this?

Thanks,

Mark

Replies:

Posted By: Ingo
Date Posted: 19 Feb 10 at 3:48PM

Hi!

You should upload a few samples telling us the url.
So we can test ourselve.

Cheers and welcome here,
Ingo

Posted By: smithmarkduane
Date Posted: 19 Feb 10 at 4:06PM

Here you go:

https://www.yousendit.com/download/RmNCSlJ4ZEswVW14dnc9PQ - https://www.yousendit.com/download/RmNCSlJ4ZEswVW14dnc9PQ

https://www.yousendit.com/download/RmNCSlIzcVhCSm8wTVE9PQ - https://www.yousendit.com/download/RmNCSlIzcVhCSm8wTVE9PQ

Thanks,

Mark

Posted By: Ingo
Date Posted: 19 Feb 10 at 8:01PM

Hi Mark!

With my test-app i can do preview, textextraction, and so on.
Creation date and modification date is absolutely okay ... no strange characters.
If i'm opening the pdf with notepad i find the date in this format: 2009-10-03
Using QuickPDF it's the same... I've rebuilt it this way: 03.10.2009
So i think it should be something on your pc?
You can be sure that it has nothing to do with QuickPDF.
Perhaps you wanna show us your relevant code snippet to check it?

Cheers, Ingo

Posted By: smithmarkduane
Date Posted: 19 Feb 10 at 10:13PM

Hi Ingo:

Thanks for looking into this. The date I am referring to is not the Creation or Modification Date meta data for the PDF, but in the text content of the document. In the header of the page you will notice 'Test Date: 10/3/2009'. On my system if I open the document in Notepad/Notepad++ and search for 10/3/2009 no match is returned. The test app I have to demonstrate this result is simply:

...
PDFLibrary := TQuickPDF0717.Create;
UnlockResult := PDFLibrary.UnlockKey('123456789');
PDFLibrary.LoadFromFile('samp1.pdf');
PDFLibrary.SelectPage(2);
memo1.Lines.Add( PDFLibrary.GetPageText(0));
...

When I run the above code, a portion of what I see is:

ï¿½2008
Copyrigh
ANX 3.0

ANSAR Medical

Patient: xxxxxxxxx
Weight: 145 lbs Height: 5 ft 6 in Gender: Female Age: 64 DOB: 1/2/1945
ANS Medications:
Other Medications & Symptoms:

Test Date: 1ï¿½/3/2ï¿½ï¿½9 Physician: xxxxxx

No. of Ectopic Beats: ï¿½

Could this be a system font or character set issue? Since I am not familiar with the internal structure of pdf's I am not sure where to look next.

Thanks,
Mark

Posted By: Ingo
Date Posted: 19 Feb 10 at 10:19PM

Hi Mark!

It's too strange. You should forget it. All numbers are correct only the one for the dates...
Try the extraction with a normal string and not tmemo... The same result?

Cheers, Ingo

Posted By: smithmarkduane
Date Posted: 19 Feb 10 at 10:39PM

Yes, same result when I extract to Normal string. Do you see this behavior also or is it just on my system? What do you mean when you say ' You should forget it'? The 'Test Date' in the page header is one of the pieces of Information I need to extract.

Thanks,
Mark

Posted By: Ingo
Date Posted: 19 Feb 10 at 10:47PM

Hi Mark!

There are other functions in the library to get the creation date and modification date.
Your problem isn't a common problem. It seems to be only on your machine. I can't explain it and i can't imagine why. Sorry.

Cheers, Ingo

Posted By: smithmarkduane
Date Posted: 19 Feb 10 at 10:54PM

Hi Ingo:

I just want to be sure I am explaining the problem correctly. I am not interested in the Creation/Modification date of the document. I am interested in extracting text from the document content, where the text happens to be a date string. My question is if you use the code:
PDFLibrary.SelectPage(2);
memo1.Lines.Add( PDFLibrary.GetPageText(0));
on the sample pdf document I uploaded do you see the strange characters I reported or do you see 'normal' text? I have run the sample app on 3 machines now (Win XP and Win 7) with the same result.

Thanks again for your help.

Mark

Posted By: Ingo
Date Posted: 19 Feb 10 at 11:07PM

Hi Mark!

I've written it already! I've seen/extracted the "normal" text and numbers - no strange characters! You're the only one having this problem. I'm using GetPageText(3) ... perhaps this makes a difference for you...
You did a good and easy to understand description of the problem. We all understand it but i'm pretty sure that nobody here can imagine why. Sorry.

Cheers, Ingo

Posted By: smithmarkduane
Date Posted: 19 Feb 10 at 11:13PM

Hi:

Thanks. Would it be possible for you to upload your sample app for me to try to see if it is my code or my system(s). The code is so simple I don't see how this could be it, but not sure where else to look.

Thanks,
Mark

Posted By: Ingo
Date Posted: 19 Feb 10 at 11:22PM

Hi!

Try my freeware PDF-Analyzer ... ;-)

Cheers, Ingo