Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
Strange characters when using GetPageText |
Post Reply |
Author | |
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
Posted: 19 Feb 10 at 3:39PM |
I have several PDF documents from the same source that I need to scan the contents to extract some basic text information. On some of the documents (not all) I get strange characters. For example (see value that follows Test Date):
Test Date: 1�/3/2��9 where Test Date s/b 10/3/2009. In fact it seems that in all cases where I see a problem it is when the 0 character should appear. Any ideas how to resolve this?
Thanks,
Mark
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi!
You should upload a few samples telling us the url. So we can test ourselve. Cheers and welcome here, Ingo |
|
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Mark!
With my test-app i can do preview, textextraction, and so on. Creation date and modification date is absolutely okay ... no strange characters. If i'm opening the pdf with notepad i find the date in this format: 2009-10-03 Using QuickPDF it's the same... I've rebuilt it this way: 03.10.2009 So i think it should be something on your pc? You can be sure that it has nothing to do with QuickPDF. Perhaps you wanna show us your relevant code snippet to check it? Cheers, Ingo Edited by Ingo - 19 Feb 10 at 8:02PM |
|
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hi Ingo:
Thanks for looking into this. The date I am referring to is not the Creation or Modification Date meta data for the PDF, but in the text content of the document. In the header of the page you will notice 'Test Date: 10/3/2009'. On my system if I open the document in Notepad/Notepad++ and search for 10/3/2009 no match is returned. The test app I have to demonstrate this result is simply: ... PDFLibrary := TQuickPDF0717.Create; UnlockResult := PDFLibrary.UnlockKey('123456789'); PDFLibrary.LoadFromFile('samp1.pdf'); PDFLibrary.SelectPage(2); memo1.Lines.Add( PDFLibrary.GetPageText(0)); ... When I run the above code, a portion of what I see is: �2008 Copyrigh ANX 3.0 ANSAR Medical Patient: xxxxxxxxx Weight: 145 lbs Height: 5 ft 6 in Gender: Female Age: 64 DOB: 1/2/1945 ANS Medications: Other Medications & Symptoms: Test Date: 1�/3/2��9 Physician: xxxxxx No. of Ectopic Beats: � Could this be a system font or character set issue? Since I am not familiar with the internal structure of pdf's I am not sure where to look next. Thanks, Mark |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Mark!
It's too strange. You should forget it. All numbers are correct only the one for the dates... Try the extraction with a normal string and not tmemo... The same result? Cheers, Ingo |
|
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Yes, same result when I extract to Normal string. Do you see this behavior also or is it just on my system? What do you mean when you say ' You should forget it'? The 'Test Date' in the page header is one of the pieces of Information I need to extract.
Thanks, Mark |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Mark!
There are other functions in the library to get the creation date and modification date. Your problem isn't a common problem. It seems to be only on your machine. I can't explain it and i can't imagine why. Sorry. Cheers, Ingo |
|
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hi Ingo:
I just want to be sure I am explaining the problem correctly. I am not interested in the Creation/Modification date of the document. I am interested in extracting text from the document content, where the text happens to be a date string. My question is if you use the code: PDFLibrary.SelectPage(2); memo1.Lines.Add( PDFLibrary.GetPageText(0)); on the sample pdf document I uploaded do you see the strange characters I reported or do you see 'normal' text? I have run the sample app on 3 machines now (Win XP and Win 7) with the same result. Thanks again for your help. Mark |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Mark!
I've written it already! I've seen/extracted the "normal" text and numbers - no strange characters! You're the only one having this problem. I'm using GetPageText(3) ... perhaps this makes a difference for you... You did a good and easy to understand description of the problem. We all understand it but i'm pretty sure that nobody here can imagine why. Sorry. Cheers, Ingo |
|
smithmarkduane
Beginner Joined: 19 Feb 10 Status: Offline Points: 6 |
Post Options
Thanks(0)
|
Hi:
Thanks. Would it be possible for you to upload your sample app for me to try to see if it is my code or my system(s). The code is so simple I don't see how this could be it, but not sure where else to look. Thanks, Mark |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi!
Try my freeware PDF-Analyzer ... ;-) Cheers, Ingo |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store