Print Page | Close Window

Exported images >> original file size?

Printed From: Debenu Quick PDF Library - PDF SDK Community Forum
Category: For Users of the Library
Forum Name: I need help - I can help
Forum Description: Problems and solutions while programming with the Debenu Quick PDF Library and Debenu PDF Viewer SDK
URL: http://www.quickpdf.org/forum/forum_posts.asp?TID=2151
Printed Date: 28 Sep 24 at 10:52PM
Software Version: Web Wiz Forums 11.01 - http://www.webwizforums.com


Topic: Exported images >> original file size?
Posted By: Dave
Subject: Exported images >> original file size?
Date Posted: 15 Feb 12 at 6:27PM
Hi all,

C# user here (I'm using the dll).
I've had a good look around and I can't see anything that describes this problem. If there is, feel free to point me in the right direction!

My PDF's should only have one image on each page (the scanner vendor's app. makes image-only PDFs) and I need to extract the image in order to make some changes to it. It gets written to a new PDF much later in the process.

I'm using SaveImageDataToFile but my testing PDF, a two-page file of 43Kb, is exporting 11Mb images per page.

Interestingly, I created my test PDF with the same library...I know my original source image was a svelte 27Kb G4 TIFF!

Is there a way of exporting the image using anything like the same compression the PDF format itself must use? 
My alternative is to extract the resolution and dimensional data and use some third-party library (tifflib port for C#, anyone?) to compress my images into a more manageable (network-friendly) size.

Any pointers or ideas most welcome.

Thanks!



Replies:
Posted By: Ingo
Date Posted: 15 Feb 12 at 7:38PM
Hi Dave!

The original images were inserted with the
original image properties shown as a snapshot
into the pdf-page. xtracting the image with the
original properties build the original image
again - so it's bigger?
You should use the RenderPage-functions
dealing with the dpi-values... this should
result in smaller files.

Cheers and welcome here,
Ingo



Posted By: Dave
Date Posted: 15 Feb 12 at 8:29PM
Hi Ingo and thanks for the welcome!

Good thinking - I didn't even see the GDI+ functions in there - but I've tried them and I'm still getting files that are well in excess of the original PDF sizes. My guess is that the GDI engine is using LZW compression for TIFF (and I don't blame it: without knowing the color depth, it's the safest 'small' option).
The GDI engine is also clipping the page slightly - printing margins, perhaps?

Hmm... the search goes on ;) 


Posted By: edvoigt
Date Posted: 15 Feb 12 at 8:49PM
Hi Dave,

with 43KB is your PDF indeed rather small, so that images inside would (I guess) be high compressable and have high resolution too.

For easy rendering look at this and use the right box (cropbox or trimbox at most)
http://www.quickpdf.org/forum/size-reduction-and-dpi_topic2146.html - http://www.quickpdf.org/forum/size-reduction-and-dpi_topic2146.html

The rendering-idea is no solution, if you need the images in original dimensions.

Cheers,
Werner


Posted By: Dave
Date Posted: 15 Feb 12 at 9:04PM
My test image is black and white - most of the guys here will scan in 1-bit - so the images are small. But: I will have to allow for the occasional 32-bit color image as well.
I wonder:....if the scanning app. makes jpg files when it is asked for color? Now that would make my life easy! I'll check!
Thanks for your comment, Werner. I think you're right that rendering won't work!

Best, (MfG Werner)
Dave 


Posted By: edvoigt
Date Posted: 16 Feb 12 at 9:56AM
Hi,

if we calculate without compression and headerinformations, one pixel in a b/w-image uses one bit. But in a fullcolor-format every pixel has four bytes. This is a (uncompressed) factor of 8*32=256! So 43KB grows up to approx. 11MB. So it's clear where the size comes.

But the question is why.

You should try to make your tests around getting your knowledge about the image inside. You may ask for imagetype, resolution and sizes. I guess, that QuickPDF is thinking, the embedded image is a jpg, but I guess only...

May be that the scanning app. is drawing the image into the PDF and wrong saying it is color?

You may figure out something if you make a list of all image properties QuickPDF is giving. Then this is to compare with the (if so) known data of the image source for building your test-PDF.

No real help, but a step?

Werner


Posted By: Dave
Date Posted: 16 Feb 12 at 2:55PM
Hi Werner - good advice! My background is in document scanning and I can confirm your maths...this is exactly what is happening!
However, this is an interesting problem because I KNOW the image in the PDF is a TIFF - I put it there! ;)

Consider this code:

qp.UnlockKey("<license>");
            bool _error = false;
            int _id = 0;
            const string TifFile = "C:\\1.tif";
            const string PDFFile = "C:\\1.pdf";
            const string NewImageFile = "C:\\image.tif";
            int _dpiX = 0;
            int _dpiY = 0;

            _id = qp.NewDocument();
            if (_id == 0)
            {
                _error = true;
                return;
            }

            // now, add the image from the temp. location
            _id = qp.AddImageFromFile(TifFile, 1);

            // select this as the current image
            qp.SelectImage(_id);
            if (_id == 0)
            {
                _error = true;
                return;
            }

            qp.SelectImage(_id);

            // Draw image on the current page
            _dpiX = qp.ImageHorizontalResolution();
            if (_dpiX == 0) _dpiX = 72;
            _dpiY = qp.ImageVerticalResolution();
            if (_dpiY == 0) _dpiY = 72;

            // check the original pagesize

            double ImageWidthInPoints = (double)qp.ImageWidth() / _dpiX * 72.0;
            double ImageHeightInPoints = (double)qp.ImageHeight() / _dpiY * 72.0;

            qp.SetPageDimensions(ImageWidthInPoints, ImageHeightInPoints);
            qp.SetOrigin(1);
            qp.DrawImage(0, 0, ImageWidthInPoints, ImageHeightInPoints);

            if (qp.SaveToFile(PDFFile) != 1)
            {
                _error = true;
                return;
            }

            /* 
             * Now we have a PDF with the TIF in it. 
             *   The resolution is correct so we know QPDF is reading the image correctly
             *   The file exists on the disk and is a little larger than the source TIF. That's
             *   okay because we expect an overhead from the PDF wrapper
             * 
             * Okay, so now reverse the process. Let's extract the same file and see what
             *   happens.
             *   
             * We can make some assumptions: the PDF only has one image, only one page. 
             *   This makes the selection logic easy. In the real world,
             *   we would be passing parameters that change these values.
             */
            
            int _DocRef = qp.LoadFromFile(PDFFile, "");
            if (_DocRef == 0)
            {
                _error = true;
                return;
            }

            if (qp.SelectPage(1) == 0)
            {
                _error = true;
                return;
            }

            int imageList = qp.GetPageImageList(0);
            if (imageList == 0)
            {
                _error = true;
                return;
            }
            int ImageListCounter = qp.GetImageListCount(imageList);
            int FindImages = qp.FindImages();

            // for reasons best known to PDF, my file has 37 items in the FindImages list...
            // so, let's check them *all* for resolution and hope one matches the '200'
            //  we know our original TIF had...
            int p = 0;
            int[] _set = new int[FindImages];
            for (int j = 0; j <= FindImages-1; j++)
            {
                p = qp.GetImageID(j+1);
                if (p > 0)
                {
                    _set[j] = p;
                }
            }

            int[,] imageids = new int[36,2];
            for (int j=1; j<=36; j++)
            {

                imageids[j - 1, 0] = qp.SelectImage(_set[j - 1]);
                imageids[j - 1, 1] = qp.ImageHorizontalResolution(); 
            }
                        
            int ImageItem = qp.GetImageListItemIntProperty(imageList, 1, 400);
            // now we can read (to file) the first image on the current page

            if (qp.SaveImageListItemDataToFile(imageList, 1, 0, NewImageFile) == 0)
            {
                _error = true;
                return;
            }

Now, if you run this you'll find that I don't get a HorizontalResolution in any one of the 37 image entries! So: where the hell's my image gone?? ;)

I'm confused: the PDF is 40Kb..so it MUST have a well-compressed copy of my TIF in there - so why the heck can't I get it out in that format?

Best,
Dave


Posted By: edvoigt
Date Posted: 17 Feb 12 at 8:47AM
Hi Dave,

I did my own test, beginning with a scan, to get a pure b/w-tiff. After this I coded only (delphi, but easy to read, I think):

  QP.SetOrigin(0);                 // Bottomleft
  iid := QP.AddImageFromFile('File0001.tif', 0); // type -1 brings 332KB!
  QP.SelectImage(iid);
  Memo1.Lines.Add(Format('type=%d',[
QP.ImageType])); // type=3=tiff
  Memo1.Lines.Add(Format('h=%d',[QP.ImageHorizontalResolution])); // 96dpi, ok

  QP.DrawImage(25, 250, w, h);
  QP.SaveToFile('FileWithTiff.pdf');

This takes my tif, shows me some properties and saves a pdf. The PDF-size corresponds to the tif.
Inside it looks good:
/Subtype /Image
/Width 258
/Height 438
/ColorSpace /DeviceGray
/BitsPerComponent 1

And now the export of our image:

  QP.LoadFromFile('FileWithTiff.pdf', '');
  QP.SelectPage(1);
  lid := QP.GetPageImageList(0);
  Memo1.Lines.Add(Format('n=%d',[QP.GetImageListCount(lid)])); // one image found
   QP.GetImageListItemIntProperty(lid, 1, 400) // reports a 2 = BMP
  QP.SaveImageListItemDataToFile(lid, 1, 0, 'File0001export.tif');

The extraction brings a difference in size of 15830-14652=1178. Why?
A look inside makes it more clear. On start the first two bytes are 'II' - TIFF, the saved image starts with 'BM' - Bitmap not TIFF.

So we are on the right way.

Open is the question, why QPL detects the image in input as TIFF and after putting into PDF, it sounds BMP? In the description above you see only the two advises:

/ColorSpace /DeviceGray
/BitsPerComponent 1

From only this it is a question of interpretation, really known is only pure b/w-image. The streamdata give (for me) no advice to make a sure decision between TIFF and BMP, because I'm not familar enough with the internals.


Werner


Posted By: Dave
Date Posted: 17 Feb 12 at 1:55PM
Thanks Werner - yes, I can read Delphi!

So it's not my environment then ;)
Okay, I raise this as a question to support now and let them know this tread exists.

Thanks again for your help; it was really useful to know I am not doing something wrong!

Best regards,
Dave



Posted By: samb
Date Posted: 22 Feb 12 at 5:45PM
FYI LibTiff .net port http://bitmiracle.com/libtiff/

Confirmed in my environment too with 8.14b5. Also tried with a TIFF LZW, PNG, GIF, and JPG files and only the JPG was returned in it's original format.  Going to streams instead of files doesn't help either.

I think the issue has to deal with the way the PDF files handle images as edvoigt was getting at.  PDF files store image data, but not image files.  If you open your created PDF with notepad, you can see your image data after the "stream" keyword.  Notice that the image data is missing the typical file type header bytes "II" to signify that it's a TIFF and just starts with the image contents.  The PDF does store the compression algorithm "/Filter /CCITFAXDECODE" so it knows how to interpret it for rendering, but it doesn't know or care that it was a "TIF" image.

So, unfortunately, returning image data is a bit more complicated than just pulling the data from the stream section, and QuickPDF obviously doesn't handle all cases as expected.

One other issue that you may run into with TIFs.  One of the properties of a TIF is rowsperstrip.  Say you have an image 100 pixels tall and rowsperstrip set to 10.  If you add that image to a PDF (with QuickPDF at least, not sure if this is a universal problem), it will actually add 10 images of 10 pixels high each.  If you go to retrieve those images, you will have to retrieve all 10 strips and merge them together (with libtif). 
I believe that the default behavior in GDI+ for windows XP is to set the rowsperstrip to the full image height, but in Windows 7 the default behavior is to set it to some set value (25?).
The only other way around this is to set the rowsperstrip property to the full height of the image.  And of course, you can't do this directly with GDI+.  LibTiff.Net has an example of doing it though.

GDI+ is also going to give you headaches trying to edit those images (standard Graphics operations wont work against images with indexed pixel types such as black and white).

So, because of the quirks with PDF, QuickPDF's and GDI+, I've found it easier to only give and retrieve Bitmap images to QuickPDF and just let it handle compression (it appears to use Flate for black and white, which is not as small as G4, but not terrible).  With .Net it's easy enough to compress an bitmap to send it across the network, then turn it back into a bitmap before dumping it in the next PDF.  This will obviously add some processing time, but at least it works. 


byte[] compressedimagedata;
byte[] imagedata = QuickPDF.SaveImageDataToString(...)
using (MemoryStream imagedatastream = new MemoryStream(imagedata))
{
     Image image = Image.FromStream(imagedatastream);
     using (MemoryStream compressedimagedatastream = new MemoryStream())
     {
          image.Save(compressedimagedatastream, ImageFormat.PNG);
          compressedimagedata = commpressedimagedatastream.ToArray();
     }
}





Print Page | Close Window

Forum Software by Web Wiz Forums® version 11.01 - http://www.webwizforums.com
Copyright ©2001-2014 Web Wiz Ltd. - http://www.webwiz.co.uk