docs/linkify.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212

.. _linkify-chapter:
.. highlightlang:: python

====================
``bleach.linkify()``
====================

``linkify()`` searches text for links, URLs, and email addresses and lets you
control how and when those links are rendered::

    def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_pre=False,
                parse_email=False, tokenizer=HTMLSanitizer):
        """Convert URL-like strings in an HTML fragment to links.

``linkify()`` works by building a document tree, so it's guaranteed never to do
weird things to URLs in attribute values, can modify the value of attributes on
``<a>`` tags, and can even do things like skip ``<pre>`` sections.

By default, ``linkify()`` will perform some sanitization, only allowing a set
of "safe" tags. Because it uses the HTML5 parsing algorithm, it will always
handle things like unclosed tags.

.. note::
   You may pass a ``string`` or ``unicode`` object, but Bleach will always
   return ``unicode``.


Callbacks
=========

The second argument to ``linkify()`` is a list or other iterable of callback
functions. These callbacks can modify links that exist and links that are being
created, or remove them completely.

Each callback will get the following arguments::

    def my_callback(attrs, new=False):

The ``attrs`` argument is a dict of attributes of the ``<a>`` tag. The ``new``
argument is a boolean indicating if the link is new (e.g. an email address or
URL found in the text) or already existed (e.g. an ``<a>`` tag found in the
text). The ``attrs`` dict also contains a ``_text`` key, which is the innerText
of the ``<a>`` tag.

The callback must return a dict of attributes (including ``_text``) or
``None``. The new dict of attributes will be passed to the next callback in the
list. If any callback returns ``None``, the link will not be created and the
original text left in place, or will be removed, and its original innerText
left in place.

The default value is simply to add ``rel="nofollow"``. See ``bleach.callbacks``
for some included callback functions.


Setting Attributes
------------------

For example, to set ``rel="nofollow"`` on all links found in the text, a simple
(and included) callback might be::

    def set_nofollow(attrs, new=False):
        attrs['rel'] = 'nofollow'
        return attrs

This would overwrite the value of the ``rel`` attribute if it was set.

You could also make external links open in a new tab, or set a class::

    from urlparse import urlparse

    def set_target(attrs, new=False):
        p = urlparse(attrs['href'])
        if p.netloc not in ['my-domain.com', 'other-domain.com']:
            attrs['target'] = '_blank'
            attrs['class'] = 'external'
        else:
            attrs.pop('target', None)
        return attrs


Removing Attributes
-------------------

You can easily remove attributes you don't want to allow, even on existing
links (``<a>`` tags) in the text. (See also :ref:`clean() <clean-chapter>` for
sanitizing attributes.)

::

    def allowed_attributes(attrs, new=False):
        """Only allow href, target, rel and title."""
        allowed = ['href', 'target', 'rel', 'title']
        return dict((k, v) for k, v in attrs.items() if k in allowed)

Or you could remove a specific attribute, if it exists::

    def remove_title1(attrs, new=False):
        attrs.pop('title', None)
        return attrs

    def remove_title2(attrs, new=False):
        if 'title' in attrs:
            del attrs['title']
        return attrs


Altering Attributes
-------------------

You can alter and overwrite attributes, including the link text, via the
``_text`` key, to, for example, pass outgoing links through a warning page, or
limit the length of text inside an ``<a>`` tag.

::

    def shorten_url(attrs, new=False):
        """Shorten overly-long URLs in the text."""
        if not new:  # Only looking at newly-created links.
            return attrs
        # _text will be the same as the URL for new links.
        text = attrs['_text']
        if len(text) > 25:
            attrs['_text'] = text[0:22] + '...'
        return attrs

::

    from urllib2 import quote
    from urlparse import urlparse

    def outgoing_bouncer(attrs, new=False):
        """Send outgoing links through a bouncer."""
        p = urlparse(attrs['href'])
        if p.netloc not in ['my-domain.com', 'www.my-domain.com', '']:
            bouncer = 'http://outgoing.my-domain.com/?destination=%s'
            attrs['href'] = bouncer % quote(attrs['href'])
        return attrs


Preventing Links
----------------

A slightly more complex example is inspired by Crate_, where strings like
``models.py`` are often found, and linkified. ``.py`` is the ccTLD for
Paraguay, so ``example.py`` may be a legitimate URL, but in the case of a site
dedicated to Python packages, odds are it is not. In this case, Crate_ could
write the following callback::

    def dont_linkify_python(attrs, new=False):
        if not new:  # This is an existing <a> tag, leave it be.
            return attrs

        # If the TLD is '.py', make sure it starts with http: or https:
        href = attrs['href']
        if href.endswith('.py') and not href.startswith(('http:', 'https:')):
            # This looks like a Python file, not a URL. Don't make a link.
            return None

        # Everything checks out, keep going to the next callback.
        return attrs


Removing Links
--------------

If you want to remove certain links, even if they are written in the text with
``<a>`` tags, you can still return ``None``::

    def remove_mailto(attrs, new=False):
        """Remove any mailto: links."""
        if attrs['href'].startswith('mailto:'):
            return None
        return attrs


``skip_pre``
============

``<pre>`` tags are often special, literal sections. If you don't want to create
any new links within a ``<pre>`` section, pass ``skip_pre=True``.

.. note::
   Though new links will not be created, existing links created with ``<a>``
   tags will still be passed through all the callbacks.


``parse_email``
===============

By default, ``linkify()`` does not create ``mailto:`` links for email
addresses, but if you pass ``parse_email=True``, it will. ``mailto:`` links
will go through exactly the same set of callbacks as all other links, whether
they are newly created or already in the text, so be careful when writing
callbacks that may need to behave differently if the protocol is ``mailto:``.


``tokenizer``
============

``linkify()`` uses the ``html5lib.sanitizer.HTMLSanitizer`` tokenizer by
default. This has the effect of scrubbing some tags and attributes. To use a
more lenient, or totally different, tokenizer, you can specify the tokenizer
class here. (See the implementation of :ref:`clean() <clean-chapter>` for an
example of building a custom tokenizer.)

::

    from html5lib.tokenizer import HTMLTokenizer
    linked_text = linkify(text, tokenizer=HTMLTokenizer)


.. _Crate: https://crate.io/