Sunday, February 21, 2016

Converting a Scanned PDF to a text file

I have done a little fiddling and seem to be able to convert a PDF that is a scan of a book, to text format here is the approach in Ubuntu.

1. Put your .pdf in a folder, navigate to that folder in the command line.
2. Type the following:
pdftoppm -png [filename].pdf [prefix]
3. Next install gocr
sudo apt-get install gocr
4. Finally this command:
for i in *.png; do gocr -i $i -o $i.txt; done

You'll have a big list of .txt files.

Now you can concatenate all the files.

cat *.txt >> [new_file].txt

I won't claim that the text files are pretty at all, but you can take them and start to massage them so that you end up with a nice text file you can then use a speed reading app with.

If you build something that cleans these up, share it below.

Alternately if the PDF is not an image but a real PDF Text file, then a simple pdftotext command should work.