» Jul 8th, 2009, category: Python , Perm Link...

They are not valid, they suck but we need them!

It 's easy to parse valid xml or xhtml but hard if they are not valid. Unfortunately, a lot of web pages are not valid but we need to parse them. libxml2 provides parsing options to solve this problem.

The options

  • HTML_PARSE_RECOVER = 1 //Relaxed parsing
  • HTML_PARSE_NOERROR = 32 //suppress error reports
  • HTML_PARSE_NOWARNING = 64 //suppress warning reports
  • HTML_PARSE_PEDANTIC = 128 //pedantic error reporting
  • HTML_PARSE_NOBLANKS = 256 //remove blank nodes
  • HTML_PARSE_NONET = 2048 //Forbid network access
  • HTML_PARSE_COMPACT = 65536 //compact small text nodes

When I can not parse some html

Below is example setup,

import libxml2

options = libxml2.HTML_PARSE_RECOVER + \
                      libxml2.HTML_PARSE_NOERROR + \
                      libxml2.HTML_PARSE_NOWARNING

class Parser(object):
    def __init__(self, doc, url='', options=options):
        self.doc = libxml2.htmlReadDoc(doc, url, 'utf-8', options)

    def __del__(self):
        self.doc.freeDoc()

Try above code, good luck.

Comments

# Siebzehn und Vier Regeln Nov 27th, 2009

HTML is not a programming language but it is a scripting languaget. We use HTML for designing purpose.I himself doing web designing in HTML .I think it is very good and easy designing tool and anyone can learn it easily but it require a lot of hard work. Example which you given above ,I will try to do it at home and try to get the solution as soon as possible.

# insurance Dec 25th, 2009

the code given you above I try at home but do not any kind good result so please give me some more explanation on this code...due to which i will run this code successfully

Post your comment.

captcha