[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to convert a XML file of US patent into a plain text file on a Linux platform?



Hello,

Henry Chang, le dim. 23 oct. 2022 20:12:45 -0400, a ecrit:
> I have successfully convert a pdf file of US patent into .png, then into .txt
> by using pdftoppm and tesseract.

pdftoppm could re-rater. Better use pdfimages which will just take the
images from the pdf unmodified.

> I found that USPTO provides plain text files in .xmal file.
> 
> From the USPTO webiste, we downloaded a XML full-text data, ipg221011.xml. This
> file contains lots of XML files of U.S. patent data. How can I convert this
> .xml file into plain text files of US patents?

that xml file doesn't seem to be actually containing the patent text.

Samuel


Reply to: