Extract text from PDF with Layout.
Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=884
Printed Date: 22 Nov 24 at 7:05PM Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com
Topic: Extract text from PDF with Layout.
Posted By: devMan
Subject: Extract text from PDF with Layout.
Date Posted: 02 Apr 08 at 9:48AM
Hi everybody !
I'm working with Visual Basic 6. And my goal, now, is to extract the text from a PDF file to import in a Oracle DB. I've found an OCX that give me the entire text in a string variable. But without separate data like the file.
My test file contain value placed in columns. And I need to have these values separated by a semi-column for example.
So, I would like to know if your library can permit to do this ?
Thanks in advance
P.S: I can send you test file if you need to understand ( I don't know if my explanation is clear .. )
|
Replies:
Posted By: chicks
Date Posted: 02 Apr 08 at 11:52AM
Your best bet is probably http://pdftohtml.sourceforge.net/ - pdftohtml . Its XML output option provides positional information. You can then do an XSL transform to get the data into your final format. It's worked well for me in the past.
|
Posted By: devMan
Date Posted: 03 Apr 08 at 2:39AM
Hi,
Thanks you for you answer!
My goal is to have the content of the PDF in my app, in a variable to treat it. If I'll can, I'll prefer to don't use the file convertion.
But I keep your solution as last solution.
|
Posted By: Ingo
Date Posted: 03 Apr 08 at 3:00AM
Hi! I'm wondering... Perhaps i don't understand but... Why not use the textextract-functions from QuickPDF? They are working page by page - so you can get the textcontent of each page. With option 3 you can get the single textstrings from each page with additional data like position on the page, font, color, ... I can't imagine that you need more ;-) Best regards, Ingo
|
Posted By: devMan
Date Posted: 03 Apr 08 at 3:32AM
Hi Ingo,
Yes, my question is just to know if QuickPDF can extract the text from a PDF having columns, and return me text separated following the PDF layout...
I've check the iSEDQuickPDF 5.11 Reference Guide.pdf, and see 3 functions : - GetPageLayout - SetPageLayout - ExtractFilePageContent
I've see that there are vb6 examples code. I hope that one of theses help me.
|
Posted By: Ingo
Date Posted: 03 Apr 08 at 5:19AM
Hi!
GetPageMode and GetPageLayout only retrieve what you're seeing opening a document... This won't help you.
For textextraction you can use this functions: DAExtractPageText ExtractFilePageText GetPageText
With option "3" you'll get csv-strings with position data (in pixel) and much more. With these data you can rebuild your pdf-layout as a textfile.
Best regards, Ingo
|
Posted By: devMan
Date Posted: 03 Apr 08 at 8:54AM
Re-Hi ;)
Thank you very much for your help Ingo ! I'm trying methodes that you send me.
But is it possible that I upload my test file to show you my exact need ?
Thank you again for your help !
Edit : I've try your function ( DAExtractPageText with DAOpenFile and DAFindPage ) to try to get the PDF text, and with the option 3, and I get a string like that :
"GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,104.2188,1112.2983,112.3148,1112.2983,112.3148,1117.0743,104.2188,1117.0743,"" "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,118.3948,1112.2983,149.1708,1112.2983,149.1708,1117.0743,118.3948,1117.0743," " "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,48.9348,1107.1483,75.7988,1107.1483,75.7988,1111.9243,48.9348,1111.9243,"" "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,91.1228,1107.1483,97.5708,1107.1483,97.5708,1111.9243,91.1228,1111.9243,"" "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,104.2148,1107.1483,112.3108,1107.1483,112.3108,1111.9243,104.2148,1111.9243,"" "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,118.3908,1107.1483,126.4868,1107.1483,126.4868,1111.9243,118.3908,1111.9243,"" "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,130.7188,1107.1483,149.1668,1107.1483,149.1668,1111.9243,130.7188,1111.9243," " "GVUIQW+PoynterAgateTwo-Comp",#000000,4.00,48.9348,1101.9983,78.7468,1101.9983,78.7468,1106.7743,48.9348,1106.7743,"" |
But all text fields ( at the end of lines ) contain spaces or not, but no values of my file...
Can you help me ?
|
Posted By: Ingo
Date Posted: 03 Apr 08 at 9:33AM
ingo [dot] schmoekel [at] ewetel [dot] net
|
Posted By: devMan
Date Posted: 03 Apr 08 at 9:39AM
Posted By: Ingo
Date Posted: 03 Apr 08 at 10:10AM
Hi!
I've get the same result... No content! How the pdf was created? Is it only scanned? Anyway i've tested more than one function - QP can't extract in this case. Sorry.
Best regards, Ingo
|
Posted By: devMan
Date Posted: 04 Apr 08 at 1:59AM
Hello,
Oula .... my pdf test is a part of an other pdf file... And I think that the person who have split the file have create a bad file... ( With an other OCX, it return me strings and numbers not displayed in the file....
Ok, I try with a new test file !
Edit : I've take a new file to test and now QuickPDF york very fine !! It's exactly what I'm searching !! It parse each part of my PDF as fields with the option 3 in the methode DAExtractPageText() ! And with the position of fields, I'll can use it to select an aera in the file...
So, I think my compagny will buy a liscence of your ActiveX !
|
Posted By: devMan
Date Posted: 04 Apr 08 at 4:59AM
Can you tell me where can I found conditions to purchase a license and all other informations about QuickZip please ?
|
Posted By: Ingo
Date Posted: 04 Apr 08 at 5:53AM
I don't know where you can get "QuickZip" :-) Perhaps you mean "QuickPDF" ;-)
Have a look here: http://www.quickpdf.org/forum/forum_posts.asp?TID=698
Best regards, Ingo
|
Posted By: devMan
Date Posted: 04 Apr 08 at 7:45AM
Oops
|
Posted By: devMan
Date Posted: 04 Apr 08 at 7:55AM
Last question (normally) :
We are a company of 120 users, and in my team, we are 5 developers. We need to by 1 license for everyone, or more ?
And after that, last technical question : If we scan a paper, with a standard scanner, is there possibility that the quickpdf don't extract the text correctly ? Have you some recommendations ?
Thanks you for all of your support !
|
Posted By: Ingo
Date Posted: 04 Apr 08 at 8:29AM
Normally a scanned text will be an image later... There are less scanner who can do scanning in an ocr-mode... Then you can do textextraction, too.
For your company you need one Enterprise-license...
Best regards, Ingo
|
Posted By: devMan
Date Posted: 04 Apr 08 at 9:47AM
And how many cost this Enterprise-license?
|
Posted By: Ingo
Date Posted: 04 Apr 08 at 11:35AM
Sorry... The correct version-name is "Site License". It's with source. If you have it you can send me the invoice or one of the smallest file from the source package and then you'll get a password for the source section to get the latest version.
http://www.shareit.com/product.html?productid=143148 - http://www.shareit.com/product.html?productid=143148
Please keep in mind: We're doing this here 'cause we like to help... we get nothing and we want nothing... one for all and all for one ;-)
We've nothing to do with the iSED-team. It still sells these old version 5.11. Here many talented "pdf-artists" had pushed the version number up to 6.02... and that's not the end.
Best regards,
Ingo
|
Posted By: devMan
Date Posted: 07 Apr 08 at 3:09AM
Ingo wrote:
Please keep in mind: We're doing this here 'cause we like to help... we get nothing and we want nothing... one for all and all for one ;-) We've nothing to do with the iSED-team. It still sells these old version 5.11. Here many talented "pdf-artists" had pushed the version number up to 6.02... and that's not the end. |
OK If I've understand, the iSEQ team sell only the licence for the v5.11 of QuickPDF, and you and your team, you're developing the new versions.
If it's the case, if we want to use your (better) version of QuickPDF, should have we something to pay ?? Or we must buy a licence for the 5.11 version on the iSEQ site ?
|
Posted By: Ingo
Date Posted: 07 Apr 08 at 4:06AM
Hi! Buy a iSed-site-license ... and send me a copy of the invoice as pdf or one of the smallest source-file. Then you'll get access to our last version... and you have to pay nothing. Best regards, Ingo
|
Posted By: devMan
Date Posted: 08 Apr 08 at 1:42AM
|