Text in PDF

From PDF to Text, a challenging problem by Marginalia.

Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format.

It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”. These glyphs may be rotated, overlap, and appear out of order, with very little semantic information attached to them.

You should probably be in awe at the fact that you can open a PDF file in your favorite viewer (or browser), hit ctrl+f, and search for text.

Today I learned, text in PDF is just mapping of glyphs to coordinates on “paper”.

Filed under