[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#504222: RFP: htmlcxx -- htmlcxx is a simple non-validating html parser library for C++



Package: wnpp
Severity: wishlist


* Package name    : htmlcxx
  Version         : 0.83
  Upstream Author : Davi Reis <davi.reis@gmail.com>
* URL             : http://htmlcxx.sourceforge.net/
* License         : LGPL
  Programming Lang: C++
  Description     : htmlcxx is a simple non-validating html parser library for C++

htmlcxx is a simple non-validating css1 and html parser for C++. Although
there are several other html parsers available, htmlcxx has some
characteristics that make it unique:

    * STL like navigation of DOM tree, using excelent's tree.hh library from
      Kasper Peeters
    * It is possible to reproduce exactly, character by character, the
original document from the parse tree
    * Bundled css parser
    * Optional parsing of attributes
    * C++ code that looks like C++ (not so true anymore)
    * Offsets of tags/elements in the original document are stored in the
    nodes of the DOM tree

The parsing politics of htmlcxx were created trying to mimic mozilla
firefox (http://www.mozilla.org) behavior. So you should expect parse trees
similar to those create by firefox. However, differently from firefox,
htmlcxx does not insert non-existent stuff in your html. Therefore,
serializing the DOM tree gives exactly the same bytes contained in the
original HTML document. 

-- 
http://syx.googlecode.com - Smalltalk YX
http://lethalman.blogspot.com - Thoughts about computer technologies
http://www.debian.org - The Universal Operating System

Attachment: signature.asc
Description: Digital signature


Reply to: