aboutsummaryrefslogtreecommitdiff
path: root/docs/clean.rst
blob: a31dc89eb7fc6d06364a8300036adee024950d83 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
.. _clean-chapter:
.. highlightlang:: python

==================
``bleach.clean()``
==================

``clean()`` is Bleach's HTML sanitization method::

    def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
              styles=ALLOWED_STYLES, strip=False, strip_comments=True):
        """Clean an HTML fragment and return it."""

Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing
algorithm and sanitize any disallowed tags or attributes. This algorithm also
takes care of things like unclosed and (some) misnested tags.

.. note::
   You may pass in a ``string`` or a ``unicode`` object, but Bleach will
   always return ``unicode``.


Tag Whitelist
=============

The ``tags`` kwarg is a whitelist of allowed HTML tags. It should be a list,
tuple, or other iterable. Any other HTML tags will be escaped or stripped from
the text.  Its default value is a relatively conservative list found in
``bleach.ALLOWED_TAGS``.


Attribute Whitelist
===================

The ``attributes`` kwarg is a whitelist of attributes. It can be a list, in
which case the attributes are allowed for any tag, or a dictionary, in which
case the keys are tag names (or a wildcard: ``*`` for all tags) and the values
are lists of allowed attributes.

For example::

    attrs = {
        '*': ['class'],
        'a': ['href', 'rel'],
        'img': ['src', 'alt'],
    }

In this case, ``class`` is allowed on any allowed element (from the ``tags``
argument), ``<a>`` tags are allowed to have ``href`` and ``rel`` attributes,
and so on.

The default value is also a conservative dict found in
``bleach.ALLOWED_ATTRIBUTES``.


Callable Filters
----------------

You can also use a callable (instead of a list) in the ``attributes`` kwarg. If
the callable returns ``True``, the attribute is allowed. Otherwise, it is
stripped. For example::

    def filter_src(name, value):
        if name in ('alt', 'height', 'width'):
            return True
        if name == 'src':
            p = urlparse(value)
            return (not p.netloc) or p.netloc == 'mydomain.com'
        return False

    attrs = {
        'img': filter_src,
    }


Styles Whitelist
================

If you allow the ``style`` attribute, you will also need to whitelist styles
users are allowed to set, for example ``color`` and ``background-color``.

The default value is an empty list, i.e., the ``style`` attribute will be
allowed but no values will be.

For example, to allow users to set the color and font-weight of text::

    attrs = {
        '*': 'style'
    }
    tags = ['p', 'em', 'strong']
    styles = ['color', 'font-weight']
    cleaned_text = bleach.clean(text, tags, attrs, styles)


Stripping Markup
================

By default, Bleach *escapes* disallowed or invalid markup. For example::

    >>> bleach.clean('<span>is not allowed</span>')
    u'&lt;span&gt;is not allowed&lt;/span&gt;

If you would rather Bleach stripped this markup entirely, you can pass
``strip=True``::

    >>> bleach.clean('<span>is not allowed</span>', strip=True)
    u'is not allowed'


Stripping Comments
==================

By default, Bleach will strip out HTML comments. To disable this behavior, set
``strip_comments=False``::

    >>> html = 'my<!-- commented --> html'

    >>> bleach.clean(html)
    u'my html'

    >>> bleach.clean(html, strip_comments=False)
    u'my<!-- commented --> html'