»
Jul 8th, 2009,
category:
Python
,
Perm Link...
They are not valid, they suck but we need them!
It 's easy to parse valid xml or xhtml but hard if they are not valid. Unfortunately, a lot of web pages are not valid but we need to parse them. libxml2 provides parsing options to solve this problem.
The options
- HTML_PARSE_RECOVER = 1 //Relaxed parsing
- HTML_PARSE_NOERROR = 32 //suppress error reports
- HTML_PARSE_NOWARNING = 64 //suppress warning reports
- HTML_PARSE_PEDANTIC = 128 //pedantic error reporting
- HTML_PARSE_NOBLANKS = 256 //remove blank nodes
- HTML_PARSE_NONET = 2048 //Forbid network access
- HTML_PARSE_COMPACT = 65536 //compact small text nodes
When I can not parse some html
Below is example setup,
import libxml2 options = libxml2.HTML_PARSE_RECOVER + \ libxml2.HTML_PARSE_NOERROR + \ libxml2.HTML_PARSE_NOWARNING class Parser(object): def __init__(self, doc, url='', options=options): self.doc = libxml2.htmlReadDoc(doc, url, 'utf-8', options) def __del__(self): self.doc.freeDoc()
Try above code, good luck.
Comments
# Siebzehn und Vier Regeln Nov 27th, 2009
HTML is not a programming language but it is a scripting languaget. We use HTML for designing purpose.I himself doing web designing in HTML .I think it is very good and easy designing tool and anyone can learn it easily but it require a lot of hard work. Example which you given above ,I will try to do it at home and try to get the solution as soon as possible.
# insurance Dec 25th, 2009
the code given you above I try at home but do not any kind good result so please give me some more explanation on this code...due to which i will run this code successfully