There's a open source tool named OCRmyPDF which claims to do what you're trying to do: see https://github.com/fritz-hh/OCRmyPDF As far as I understand, it makes use of standard GNU/Linux software and produces a searchable pdf file (which implies in my understanding that the text is extractable). I haven't used this tool. Maybe, the source code could give you some hints. -- Regards, jvp.