diff options
Diffstat (limited to 'docs/clean.rst')
-rw-r--r-- | docs/clean.rst | 122 |
1 files changed, 122 insertions, 0 deletions
diff --git a/docs/clean.rst b/docs/clean.rst new file mode 100644 index 0000000..a31dc89 --- /dev/null +++ b/docs/clean.rst @@ -0,0 +1,122 @@ +.. _clean-chapter: +.. highlightlang:: python + +================== +``bleach.clean()`` +================== + +``clean()`` is Bleach's HTML sanitization method:: + + def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, + styles=ALLOWED_STYLES, strip=False, strip_comments=True): + """Clean an HTML fragment and return it.""" + +Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing +algorithm and sanitize any disallowed tags or attributes. This algorithm also +takes care of things like unclosed and (some) misnested tags. + +.. note:: + You may pass in a ``string`` or a ``unicode`` object, but Bleach will + always return ``unicode``. + + +Tag Whitelist +============= + +The ``tags`` kwarg is a whitelist of allowed HTML tags. It should be a list, +tuple, or other iterable. Any other HTML tags will be escaped or stripped from +the text. Its default value is a relatively conservative list found in +``bleach.ALLOWED_TAGS``. + + +Attribute Whitelist +=================== + +The ``attributes`` kwarg is a whitelist of attributes. It can be a list, in +which case the attributes are allowed for any tag, or a dictionary, in which +case the keys are tag names (or a wildcard: ``*`` for all tags) and the values +are lists of allowed attributes. + +For example:: + + attrs = { + '*': ['class'], + 'a': ['href', 'rel'], + 'img': ['src', 'alt'], + } + +In this case, ``class`` is allowed on any allowed element (from the ``tags`` +argument), ``<a>`` tags are allowed to have ``href`` and ``rel`` attributes, +and so on. + +The default value is also a conservative dict found in +``bleach.ALLOWED_ATTRIBUTES``. + + +Callable Filters +---------------- + +You can also use a callable (instead of a list) in the ``attributes`` kwarg. If +the callable returns ``True``, the attribute is allowed. Otherwise, it is +stripped. For example:: + + def filter_src(name, value): + if name in ('alt', 'height', 'width'): + return True + if name == 'src': + p = urlparse(value) + return (not p.netloc) or p.netloc == 'mydomain.com' + return False + + attrs = { + 'img': filter_src, + } + + +Styles Whitelist +================ + +If you allow the ``style`` attribute, you will also need to whitelist styles +users are allowed to set, for example ``color`` and ``background-color``. + +The default value is an empty list, i.e., the ``style`` attribute will be +allowed but no values will be. + +For example, to allow users to set the color and font-weight of text:: + + attrs = { + '*': 'style' + } + tags = ['p', 'em', 'strong'] + styles = ['color', 'font-weight'] + cleaned_text = bleach.clean(text, tags, attrs, styles) + + +Stripping Markup +================ + +By default, Bleach *escapes* disallowed or invalid markup. For example:: + + >>> bleach.clean('<span>is not allowed</span>') + u'<span>is not allowed</span> + +If you would rather Bleach stripped this markup entirely, you can pass +``strip=True``:: + + >>> bleach.clean('<span>is not allowed</span>', strip=True) + u'is not allowed' + + +Stripping Comments +================== + +By default, Bleach will strip out HTML comments. To disable this behavior, set +``strip_comments=False``:: + + >>> html = 'my<!-- commented --> html' + + >>> bleach.clean(html) + u'my html' + + >>> bleach.clean(html, strip_comments=False) + u'my<!-- commented --> html' |