What is the problem
I want to get only text from any html chunk. I accidentally found this after I spent sometime to find the way to get only text from html. The solution out there make use regular expression which work in some cases but not all. Another solution is to have list of html tags to be removed and iterate over the the html node using some library. This work find but i don't want to configure what html tag I want to remove and this solution is expensive resource usage for me, especially for the last html.
What is XPath
From http://en.wikipedia.org/wiki/XPath
XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).
How it works
The solution is very simple, select all texts from html, that all. I will use libxml2 for example. Let see it in action. I will get all text from google.com
import libxml2 from urllib import urlopen doc = urlopen('http://google.com') xml = libxml2.parseFile(doc) all_texts = xml.xpathEval("//text()") texts = ''.join(all_texts)
That 's all! No regular expression, no iterate over html, no configuration and very powerful, right?
Turn this thing to django template filter
import libxml2 from django import template from django.template import Library from django.template.defaultfilters import stringfilter register = Library() @register.filter @stringfilter def text_only(html): xml = libxml2.parseFile(html.encode('utf-8')) all_texts = xml.xpathEval("//text()") return ''.join(all_texts)
In your template,
{% load util_tags %} ... {{ post.body|text_only }} ...
You are done!