Do you own a Debenu Quick PDF Library version 7, 8, 9, 10, 11, 12, 13 or iSEDQuickPDF license? Upgrade to Debenu Quick PDF Library 14 today!
How to mantain a searchable PDF File? |
Post Reply |
Author | |
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
Posted: 28 Oct 20 at 3:28AM |
Hi,
I have a searchable PDF file transformed from MS Word. I need retrieve some pages to tiff image file, then replace tiff image into dedicate pages of searchable PDF file. How I can do? Would you please provide me sample code? I'm using Delphi 7 to develop system. C# is ok too.
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(1)
|
Hi Mike, you want to export single pages from your pdf in tiff-Format (pdf-page -> tiff). As a second step you want to replace these page positions inside your pdf with a new tiff (tiff -> pdf-page). This is possible with QuickPDF but these new replaced pages won't be searchable anymore. But i think this will be your show stopper? Cheers and welcome here, Ingo |
|
Cheers,
Ingo |
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(0)
|
If I scan a document with my Canon scanner, which includes an OCR function, it creates a picture of the document, which is displayed as an image in a PDF viewer- you can see that it is an image because it looks very slight fuzzy (it is not a very high resolution scanner). But cleverly Canon also invisibly places the OCR text below the image, so that it appears you can select it and search the image. This scanned image then appears to be fully searchable. In theory you could implement the same trick with QuickPDF on a TIFF file which you have placed on the page, but you would need both to develop or incorporate an OCR facility AND the means to place OCR output text invisibly (white on white) on a layer under the image in exactly the right position and size, so that you can draw a box to highlight the search result. In the end it would be much cheaper and much quicker to purchase an inexpensive scanner which has an OCR feature to scan each page which on which you have inserted a TIFF.
The searchable PDF you save from Office is completely different. It only contains the text, and the text is rendered onto the viewed or printed page directly, so is always searchable. Once you have destroyed all this by rendering it to TIFF, it cannot be unscrambled without using OCR. |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Dear Ingo,
Thx for your help, You are right, I want to replace page positions inside my pdf with a new tiff (tiff -> pdf-page) and won't be searchable anymore. Would you please provide me sample code to export single pages from my pdf in tiff-Format (pdf-page -> tiff) and how to replace these page positions inside my pdf with a new tiff (tiff -> pdf-page). I'm usiing delphi 7 to develop system.
|
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Dear frost,
Thx for you help. I need export some pages to tiff image, and replace back with new tiff image, it's okay for me these imported pages no longer searchable, the others pages still searchable.
|
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(0)
|
To export, look at this function:
https://www.debenu.com/docs/pdf_library_reference/RenderPageToFile.php To import, see: https://www.debenu.com/docs/pdf_library_reference/AddImageFromFile.php There are many similar functions shown in the reference guide, and there are examples of using them in the scripts that you can find in DebenuPDFLibrartyDemo.exe.
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi,
i think you'll succeed yourself in using RenderPageToFile (pdf to image format). Here's a sample going the other way round (AddImageFromFile): https://www.debenu.com/kb/add-images-pdf-programmatically/ |
|
Cheers,
Ingo |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Hi, Thx
for your help. I can render page to tiff now, but I have been encountering on
problem. One Traditional Chinese font broken. There are several fonts in the
pdf page, all fonts are normal except one fonts named ”標楷體” has broken. Do you know why? Should I set any parameters before using RenderPageToFile()? I’m using RenderPageToFile(600, 1, 10, ‘C:\test.tiff’) Edited by dky - 03 Nov 20 at 5:45AM |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
I can't post the tiff image screen capture here.
Edited by dky - 03 Nov 20 at 5:48AM |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi Mike,
it seems to me that these chinese font doesn't exist on your pc? Normal standard fonts can be rendered by QuickPDF - but not this strange font. If the font isn't installed on your pc rendering should fail. To avoid problems like this the creators notmally embed special fonts into the pdf. Perhaps this wasn't made or the file is detached now or there was an error while embedding. To check this problem more deeper you should upload the pdf anywhere on a free filehoster and post the link here. |
|
Cheers,
Ingo |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Hi Indo,
Thx again for your help. The chinese fonts does exists on my PC because its a papular font to us, and I can view the pdf using Acrobat reader normally. I was encontered the same problem using another PDF SDK named Spire.Pdf, then they update anoter verison including one Spire.FontType.dll and set the documentTextIsAsiaFont parameter to ture in program, then it's okay. The URL list below is the tested files, you can download to test. 1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010. 2.WordToPDF_Standard_Export0001.tif : This is the tiff image extract by RenderPageToFile() function. One Chinese font broken. 3.Test_01PDF_OriginalPDF.jpg : This is the original sechable PDF generate by MS Word 2010 and view by Acrobat reader DC. 4.Test_02Tiff_MarkBrokenFont.jpg : This is the screen capture image of broken tiff and I mark the broken font by red retangles.
Edited by dky - 04 Nov 20 at 1:22AM |
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(0)
|
Thanks for uploading these files. I can reproduce your font problem with QPDF 18.11. We have over the years reported several problems with CJK fonts in QPDF and though the rendering is much improved, it is not perfect in the default renderer.
I recommend that you try a different renderer, such as the PDFIUM renderer, which is supplied with Quick PDF. You can use the SelectRenderer function to try it: read the documentation for this function for details. When I render your PDF with PDFIUM the font issues do not occur, but our application uses direct calls to the PDFIUM DLL, not via Quick PDF. With PDFIUM the font looks exactly like your PDF, but I do not read Chinese so I cannot be 100% certain. In our application we work mainly with faxable TIF at 200dpi so all your smaller fonts are a bit fuzzy in the TIF here, though the characters look correct. Our applications can use either the standard renderer or PDFIUM; note that the latter may not work in a multi-threaded application.
|
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
i agree with tfrost.
Looking from greater distance it seems to be okay but coming nearer it looks a bit inaccurate and shady but not really wrong but i'm not a chinese ;-) |
|
Cheers,
Ingo |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Hi tfrost & Ingo, Thanks for you two, I can render page to tiff with normal Chinese
fonts, but the other problems occurred. The original PDF was searchable every page,
after I delete or insert page into pdf and save pdf, the pages after I delete
or insert is no longer searchable. The other pages before delete or insert were
still searchable. Am I miss anything before delete or insert page? The URL list below is the tested files, you can download to
test. https://drive.google.com/drive/folders/1bZfm0ELUe8U_djisLaAAesPE8XpCmMW6?usp=sharing 1.WordToPDF_Standard.pdf : This is the original sechable PDF generate by MS Word 2010, every pages
is searchable. 2. WordToPDF_Standard_SaveInsertPage.pdf : This is the pdf I insert one blank page into page 2. Pages after page2
are no longer searchable, pages before page 2 still searchable. 3. WordToPDF_Standard_SaveDeletePage.pdf : This is the pdf I delete page 2. Pages after page2 are no longer
searchable, pages before page 2 still searchable. |
|
Ingo
Moderator Group Joined: 29 Oct 05 Status: Offline Points: 3524 |
Post Options
Thanks(0)
|
Hi, we told it already... The leading page wasn't touched - so there's always the parallel textcontent behind the image. The whole extracted page was made to an IMAGE (without the content behind) and this lonely IMAGE was made to the new pdf-page. You don't make anything wrong - it's like it is. |
|
Cheers,
Ingo |
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Hi, Ingo
My problem is not extract image, I just insert one blank page into or delete one current page from the pdf file then save pdf file, the pages before I insert or delete are still searchable, but pages after that page are unsearchable, it's seems like the page contents were destroyed after insert or delete page.
|
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(1)
|
Something seems wrong with your testing method. Because your two files insertpage and deletepage are fully searchable here.
For the insertpage version, in PDF Tools Pro I could still extract the text on all pages, shown as being on the correct page. In Acrobat Reader DC I could highlight and copy text on page 3 (formerly page 2), for example as 客戶基本資料表. And if I paste this into the find dialog, Acrobat finds it in the text and places a highlight over it, as expected. The same applies with deletepage, where I can copy text at the top of page 2 (formerly 3) as for example 因發生第 and also paste this into the find dialog and successfully find it. In the case of this document I found it a little harder to select the exact characters - I sometimes got an extra character in the selection. But it works, basically. The example copied glyphs above appear OK in the preview in my browser here, but I guess they may not show correctly in all browsers.
|
|
dky
Beginner Joined: 27 Oct 20 Status: Offline Points: 12 |
Post Options
Thanks(0)
|
Hi tfrost,
Thanks again for your help. I can't seach the pdf after I insert one page into page 2 using Acrobat Reader DC Chinese Edition (version 2020.013.20064), when I search keyword it search only in page 1 and loop, it will not skip automatically to search the keyword after page 2. I need to skip to page by myself. But something strange, it will normal in DC English Edition in my customers PC. The url list below, and the file named "WordToPDFA3_InsertPage.pdf". There is one more strange problem, I can open a pdf and save to another pdf file, and it's normal. But I open a pdf file provided by my customer and save it to another pdf as well as I do with my own test file, the pdf file I saved was broken and can't not opened in acrabat reader, it's show me the pdf file crash. The file was named "TestSave_Bad.pdf", the normal original file was named "TestSave_Original.pdf". The pdf file is bigger than the original one event I delete pages or just open and save. Finally, can I transform pdf from color pages to Black and White pages to reduce pdf file size? Thanks you very much.
|
|
tfrost
Senior Member Joined: 06 Sep 10 Location: UK Status: Offline Points: 437 |
Post Options
Thanks(0)
|
Since I do not have and could not use a Chinese Acrobat version, I cannot help with differences in how an English and Chinese Acrobat process your PDF. And I suggest that this and also the 'bad PDF' issue are matters you should raise with Foxit support.
|
|
WalterWiggos
Beginner Joined: 05 Mar 23 Status: Offline Points: 1 |
Post Options
Thanks(0)
|
To extract specific pages from a PDF and save them as TIFF image files, you can use a PDF library such as iTextSharp. Once you have the TIFF files, you can use a library like LibTiff to manipulate them as needed. To replace specific pages in your searchable PDF file with the TIFF images, you can use a PDF library to insert the images into the PDF at the desired page location. I won't have time to keep up with modern technologies, as everything is developing too fast. Recently, our company has integrated a document scanner from https://smartengines.com/. This scanner is very effective and, most importantly, reduces the workload of employees.
Edited by WalterWiggos - 07 Mar 23 at 9:13PM |
|
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
Copyright © 2017 Debenu. Debenu Quick PDF Library is a PDF SDK. All rights reserved. About — Contact — Blog — Support — Online Store