Hi, I'm trying to extract text in a specific area, on a large number of pdf files. My first approach is to loop for every file, open the file, select the page and proceed to extract the text with GetPageText:
//Code to initialize dll reference DPDF
int i = 0; int mode = 7; List<string> foundlines = new List<string>(); for (; i < pdffiles.Length; i++) { if (DPDF.LoadFromFile(pdffiles, "") != 0) { if (DPDF.SelectPage(1) != 0)//I'm always searching in the first page { DPDF.SetMeasurementUnits(1);//Millimeters DPDF.SetOrigin(1);//Left-Top margin
//field contains extraction area data if (DPDF.SetTextExtractionArea(field.Left, field.Top, field.Width, field.Height) == 1) { foundlines.Add(DPDF.GetPageText(mode).ToString().Trim()); } DPDF.RemoveDocument(DPDF.SelectedDocument()); } else { errormessage = "SelectPage: " + pdffiles; break; } } else { errormessage = "LoadFromFile: " + pdffiles; break; } }//Extraction cycle end here
if (string.IsNullOrEmpty(errormessage)) { if (foundlines != null && foundlines.Count > 0) { File.WriteAllLines(@"C:\resultlines.txt", foundlines.ToArray()); result = true; } }
|
It works fine, but it's not very fast, and it uses lot of memory. Worried by this results, I choosed to give a try to the ExtractFilePageText, so to keep low CPU and memory occupation. So I've changed the above cycle in this way:
int i = 0; int mode = 7; List<string> foundlines = new List<string>(); DPDF.SetMeasurementUnits(1);//Millimeters DPDF.SetOrigin(1);//Left-Top margin for (; i < pdffiles.Length; i++) { //field contains extraction area data if (DPDF.DASetTextExtractionArea(field.Left, field.Top, field.Width, field.Height) == 1) { foundlines.Add(DPDF.ExtractFilePageText(pdffiles, "", 1, mode).ToString().Trim()); } }//Extraction cycle end here
if (foundlines != null && foundlines.Count > 0) { File.WriteAllLines(@"C:\resultlines.txt", foundlines.ToArray()); result = true; }
|
This does not find anything. There is a simple explanation for this: Documentation says http://www.debenu.com/docs/pdf_library_reference/DASetTextExtractionArea.php" rel="nofollow - DASetTextExtractionArea is relative to the bottom left corner of the page, and do no mention a way to make the SetOrigin (or the SetMeasurementUnits), affect this function.
There is not a way to do so? The ExtractFilePageText can be only used with the default origin?
Thank you.
|