Strange characters when using GetPageText

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   I have several PDF documents from the same source that I need to scan the contents to extract some basic text information.  On some of the documents (not all) I get strange characters.  For example (see value that follows Test Date):

Test Date: 1ï¿½/3/2ï¿½ï¿½9

where Test Date s/b 10/3/2009.  In fact it seems that in all cases where I see a problem it is when the 0 character should appear.  Any ideas how to resolve this?

Thanks,
Mark

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Topic: Strange characters when using GetPageText Posted: 19 Feb 10 at 3:39PM
	I have several PDF documents from the same source that I need to scan the contents to extract some basic text information. On some of the documents (not all) I get strange characters. For example (see value that follows Test Date): Test Date: 1ï¿½/3/2ï¿½ï¿½9 where Test Date s/b 10/3/2009. In fact it seems that in all cases where I see a problem it is when the 0 character should appear. Any ideas how to resolve this? Thanks, Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 3:48PM
	Hi! You should upload a few samples telling us the url. So we can test ourselve. Cheers and welcome here, Ingo

smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 4:06PM
	Here you go: https://www.yousendit.com/download/RmNCSlJ4ZEswVW14dnc9PQ https://www.yousendit.com/download/RmNCSlIzcVhCSm8wTVE9PQ Thanks, Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 8:01PM
	Hi Mark! With my test-app i can do preview, textextraction, and so on. Creation date and modification date is absolutely okay ... no strange characters. If i'm opening the pdf with notepad i find the date in this format: 2009-10-03 Using QuickPDF it's the same... I've rebuilt it this way: 03.10.2009 So i think it should be something on your pc? You can be sure that it has nothing to do with QuickPDF. Perhaps you wanna show us your relevant code snippet to check it? Cheers, Ingo Edited by Ingo - 19 Feb 10 at 8:02PM

smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 10:13PM
	Hi Ingo: Thanks for looking into this. The date I am referring to is not the Creation or Modification Date meta data for the PDF, but in the text content of the document. In the header of the page you will notice 'Test Date: 10/3/2009'. On my system if I open the document in Notepad/Notepad++ and search for 10/3/2009 no match is returned. The test app I have to demonstrate this result is simply: ... PDFLibrary := TQuickPDF0717.Create; UnlockResult := PDFLibrary.UnlockKey('123456789'); PDFLibrary.LoadFromFile('samp1.pdf'); PDFLibrary.SelectPage(2); memo1.Lines.Add( PDFLibrary.GetPageText(0)); ... When I run the above code, a portion of what I see is: ï¿½2008 Copyrigh ANX 3.0 ANSAR Medical Patient: xxxxxxxxx Weight: 145 lbs Height: 5 ft 6 in Gender: Female Age: 64 DOB: 1/2/1945 ANS Medications: Other Medications & Symptoms: Test Date: 1ï¿½/3/2ï¿½ï¿½9 Physician: xxxxxx No. of Ectopic Beats: ï¿½ Could this be a system font or character set issue? Since I am not familiar with the internal structure of pdf's I am not sure where to look next. Thanks, Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 10:19PM
	Hi Mark! It's too strange. You should forget it. All numbers are correct only the one for the dates... Try the extraction with a normal string and not tmemo... The same result? Cheers, Ingo

smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 10:39PM
	Yes, same result when I extract to Normal string. Do you see this behavior also or is it just on my system? What do you mean when you say ' You should forget it'? The 'Test Date' in the page header is one of the pieces of Information I need to extract. Thanks, Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 10:47PM
	Hi Mark! There are other functions in the library to get the creation date and modification date. Your problem isn't a common problem. It seems to be only on your machine. I can't explain it and i can't imagine why. Sorry. Cheers, Ingo

smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 10:54PM
	Hi Ingo: I just want to be sure I am explaining the problem correctly. I am not interested in the Creation/Modification date of the document. I am interested in extracting text from the document content, where the text happens to be a date string. My question is if you use the code: PDFLibrary.SelectPage(2); memo1.Lines.Add( PDFLibrary.GetPageText(0)); on the sample pdf document I uploaded do you see the strange characters I reported or do you see 'normal' text? I have run the sample app on 3 machines now (Win XP and Win 7) with the same result. Thanks again for your help. Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 11:07PM
	Hi Mark! I've written it already! I've seen/extracted the "normal" text and numbers - no strange characters! You're the only one having this problem. I'm using GetPageText(3) ... perhaps this makes a difference for you... You did a good and easy to understand description of the problem. We all understand it but i'm pretty sure that nobody here can imagine why. Sorry. Cheers, Ingo

smithmarkduane Members Profile Find Members Posts Beginner Joined: 19 Feb 10 Status: Offline Points: 6	Post Options Post Reply Quote smithmarkduane Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 11:13PM
	Hi: Thanks. Would it be possible for you to upload your sample app for me to try to see if it is my code or my system(s). The code is so simple I don't see how this could be it, but not sure where else to look. Thanks, Mark

Ingo Members Profile Find Members Posts Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524	Post Options Post Reply Quote Ingo Report Post Thanks(0) Quote Reply Posted: 19 Feb 10 at 11:22PM
	Hi! Try my freeware PDF-Analyzer ... ;-) Cheers, Ingo