![]() See here a set of Dolphin service menu actions for the various extraction options: Other variables can be used instead of tiff (which gives tif files), for example png or jpeg. What I prefer after a few trials is a CLI reliable albeit slower method, with a command like: pdftoppm MY_PDF.pdf NAME -tiff - as said here. But it can still be used by setting a new batch/range of pages to be extracted. Master PDF Editor can do this but on my machine it crashes after extracting about 80 images. Extract all pdf pages as image files - because this tool doesn't process pdf directly and needs images.Here is how to use Scan Tailor with pdfs: The resulted pdf is larger than the input (because of the -remove-background option): reduce the size as said below.Ībout Scan Tailor, as a complement to the main answerĮven its icon illustrates the fact that it is intended exactly for what is asked here: The best solution seems to me to first print to pdf the initial file (which removes OCR), and then do ocrmypdf input.pdf output.pdf -l -remove-background -vįor English, the -l option is not needed. s, -skip-text Skip OCR on any pages that already contain text, but include the page in final output useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pagesĪpplying each separately to one of my large files with hundreds of pages that already had OCR crashed the process. f, -force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF) The initial pdf already had OCR, which gives an error unless one of the following options are used: remove-background Attempt to remove background from gray or color pages, setting it to white Using ocrmypdf to restore OCR (as mentioned at the end of the complementary part of this answer) I have noticed that ocrmypdf -h shows an option which sounded like exactly what is asked: What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.Īs noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.Īs said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.Īs a direct solution on PDF (no manual image extraction): The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here. The second ones are what one would expect from scanned text, and can be printed: Such copies can hardly be re-printed on paper, as the background will be printed too. The first are formed of images that look like pictures taken of book pages: What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text. ![]() There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. To clarify what I mean I will post a few examples. This is not about OCR and not about converting image to text. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.) I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. by using graphicsmagick.How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. Yoder wrote below was also my own thought on the issue, meaning that the pdf was indeed created by splitting the original image into 227 stripes.Īnd in conclusion, if that is so ( pdfimages -list says it is so) is there a way to automatically create one single image out of the stripes e.g. So far as I checked them, all images seem to be 5 px height, and with 227 single images, I should get one single image of 1604 x 1135 px instead. the above pdfimages -png name.pdf out- gives me the 227 single images. Is there a way to get one single image instead? Using pdfimages -list tells me the info about the stripes, and using e.g. using png)įor some reason, it seems that the source image is made out of 226 image stripes, and when I extract these e.g. I have an important pdf where I need to extract the source image, as lossless as possible (e.g.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |