aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore8
-rw-r--r--.travis.yml6
-rw-r--r--CHANGES87
-rw-r--r--CONTRIBUTORS26
-rw-r--r--README.rst141
-rw-r--r--bleach/__init__.py230
-rw-r--r--bleach/callbacks.py15
-rw-r--r--bleach/sanitizer.py6
-rw-r--r--bleach/tests/test_basics.py5
-rw-r--r--bleach/tests/test_css.py16
-rw-r--r--bleach/tests/test_delinkify.py109
-rw-r--r--bleach/tests/test_links.py200
-rw-r--r--bleach/tests/test_security.py4
-rw-r--r--docs/Makefile153
-rw-r--r--docs/clean.rst122
-rw-r--r--docs/conf.py242
-rw-r--r--docs/goals.rst76
-rw-r--r--docs/index.rst69
-rw-r--r--docs/linkify.rst212
-rw-r--r--docs/make.bat190
-rw-r--r--requirements.txt4
-rw-r--r--setup.py4
22 files changed, 1510 insertions, 415 deletions
diff --git a/.gitignore b/.gitignore
deleted file mode 100644
index 6714ae6..0000000
--- a/.gitignore
+++ /dev/null
@@ -1,8 +0,0 @@
-*.pyo
-*.pyc
-pip-log.txt
-.coverage
-dist
-*.egg-info
-.noseids
-build
diff --git a/.travis.yml b/.travis.yml
deleted file mode 100644
index e767f15..0000000
--- a/.travis.yml
+++ /dev/null
@@ -1,6 +0,0 @@
-language: python
-python:
- - "2.6"
- - "2.7"
-install: pip install -Ur requirements.txt --use-mirrors
-script: nosetests
diff --git a/CHANGES b/CHANGES
new file mode 100644
index 0000000..d9bad9c
--- /dev/null
+++ b/CHANGES
@@ -0,0 +1,87 @@
+Bleach Changes
+==============
+
+Version 1.2.1
+-------------
+
+- clean() no longer considers "feed:" an acceptable protocol due to
+ inconsistencies in browser behavior.
+
+
+Version 1.2
+-----------
+
+- linkify() has changed considerably. Many keyword arguments have been
+ replaced with a single callbacks list. Please see the documentation
+ for more information.
+- Bleach will no longer consider unacceptable protocols when linkifying.
+- linkify() now takes a tokenizer argument that allows it to skip
+ sanitization.
+- delinkify() is gone.
+- Removed exception handling from _render. clean() and linkify() may now
+ throw.
+- linkify() correctly ignores case for protocols and domain names.
+- linkify() correctly handles markup within an <a> tag.
+
+
+Version 1.1.3
+-------------
+
+- Fix parsing bare URLs when parse_email=True.
+
+
+Version 1.1.2
+-------------
+
+- Fix hang in style attribute sanitizer. (#61)
+- Allow '/' in style attribute values.
+
+
+Version 1.1.1
+-------------
+
+- Fix tokenizer for html5lib 0.9.5.
+
+
+Version 1.1.0
+-------------
+
+- linkify() now understands port numbers. (#38)
+- Documented character encoding behavior. (#41)
+- Add an optional target argument to linkify().
+- Add delinkify() method. (#45)
+- Support subdomain whitelist for delinkify(). (#47, #48)
+
+
+Version 1.0.4
+-------------
+
+- Switch to SemVer git tags.
+- Make linkify() smarter about trailing punctuation. (#30)
+- Pass exc_info to logger during rendering issues.
+- Add wildcard key for attributes. (#19)
+- Make linkify() use the HTMLSanitizer tokenizer. (#36)
+- Fix URLs wrapped in parentheses. (#23)
+- Make linkify() UTF-8 safe. (#33)
+
+
+Version 1.0.3
+-------------
+
+- linkify() works with 3rd level domains. (#24)
+- clean() supports vendor prefixes in style values. (#31, #32)
+- Fix linkify() email escaping.
+
+
+Version 1.0.2
+-------------
+
+- linkify() supports email addresses.
+- clean() supports callables in attributes filter.
+
+
+Version 1.0.1
+-------------
+
+- linkify() doesn't drop trailing slashes. (#21)
+- linkify() won't linkify 'libgl.so.1'. (#22)
diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index f612983..f014916 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -1 +1,25 @@
-See https://github.com/jsocol/bleach/contributors
+Bleach is written and maintained by James Socol and various contributors
+within and without the Mozilla Corporation and Foundation.
+
+Lead Developer:
+
+- James Socol <james@mozilla.com>
+
+Contributors:
+
+- Jeff Balogh
+- Ricky Rosario
+- Chris Beaven
+- Luis Nell
+
+Patches:
+
+- Les Orchard
+- Paul Craciunoiu
+- Sébastien Fievet
+- TimothyFitz
+- Adrian "ThiefMaster"
+- Adam Lofts
+- Anton Kovalyov
+- Mark Paschal
+- Alex Ehlke
diff --git a/README.rst b/README.rst
index 08dfc50..093edc1 100644
--- a/README.rst
+++ b/README.rst
@@ -16,8 +16,8 @@ Because it relies on html5lib_, Bleach is as good as modern browsers at dealing
with weird, quirky HTML fragments. And *any* of Bleach's methods will fix
unbalanced or mis-nested tags.
-The version on `github <http://github.com/jsocol/bleach>`_ is the most
-up-to-date and contains the latest bug fixes.
+The version on GitHub_ is the most up-to-date and contains the latest bug
+fixes. You can find full documentation on `ReadTheDocs`_.
Basic Use
@@ -33,9 +33,6 @@ The simplest way to use Bleach is::
>>> bleach.linkify('an http://example.com url')
u'an <a href="http://example.com" rel="nofollow">http://example.com</a> url
- >>> bleach.delinkify('a <a href="http://ex.mp">link</a>')
- u'a link'
-
*NB*: Bleach always returns a ``unicode`` object, whether you give it a
bytestring or a ``unicode`` object, but Bleach does not attempt to detect
incoming character encodings, and will assume UTF-8. If you are using a
@@ -43,117 +40,27 @@ different character encoding, you should convert from a bytestring to
``unicode`` before passing the text to Bleach.
-Customizing Bleach
-==================
-
-``clean()``, ``linkify()`` and ``delinkify()`` can take several optional
-keyword arguments to customize their behavior.
-
-
-``clean()``
------------
-
-``bleach.clean()`` is the primary tool in Bleach. It uses html5lib_ to parse a
-document fragment into a tree and does the sanitization during tokenizing,
-which is incredibly powerful and has several advantages over regular
-expression-based sanitization.
-
-``tags``
- A whitelist of HTML tags. Must be a list. Defaults to
- ``bleach.ALLOWED_TAGS``.
-``attributes``
- A whitelist of HTML attributes. Either a list, in which case all attributes
- are allowed on all elements, or a dict, with tag names as keys and lists of
- allowed attributes as values ('*' is a wildcard key to allow an attribute on
- any tag). Or it is possible to pass a callable instead of a list that accepts
- name and value of attribute and returns True of False. Defaults to
- ``bleach.ALLOWED_ATTRIBUTES``.
-``styles``
- A whitelist of allowed CSS properties within a ``style`` attribute. (Note
- that ``style`` attributes are not allowed by default.) Must be a list.
- Defaults to ``[]``.
-``strip``
- Strip disallowed HTML instead of escaping it. A boolean. Defaults to
- ``False``.
-``strip_comments``
- Strip HTML comments. A boolean. Defaults to ``True``.
-
-
-``linkify()``
--------------
-
-``bleach.linkify()`` turns things that look like URLs or (optionally) email
-addresses and turns them into links. It does this smartly, only looking in text
-nodes, and never within ``<a>`` tags.
-
-There are options that affect output, and some of these are also applied to
-links already found in the text. These are designed to allow you to set
-attributes like ``rel="nofollow"`` or ``target``, or push outgoing links
-through a redirection URL, and do this to links already in the text, as well.
-
-``nofollow``
- Add ``rel="nofollow"`` to non-relative links (both created by ``linkify()``
- and those already present in the text). Defaults to ``True``.
-``filter_url``
- A callable through which the ``href`` attribute of links (both created by
- ``linkify()`` and already present in the text) will be passed. Must accept a
- single argument and return a string.
-``filter_text``
- A callable through which the text of links (only those created by
- ``linkify``) will be passed. Must accept a single argument and return a
- string.
-``skip_pre``
- Do not create new links inside ``<pre>`` sections. Still follows
- ``nofollow``. Defaults to ``False``.
-``parse_email``
- Linkify email addresses with ``mailto:``. Defaults to ``False``.
-``target``
- Set a ``target`` attribute on links. Like ``nofollow``, if ``target`` is not
- ``None``, will set the attribute on links already in the text, as well.
- Defaults to ``None``.
-
-
-``delinkify()``
----------------
-
-``bleach.delinkify()`` is basically the opposite of ``linkify()``. It strips
-links out of text except, optionally, relative links, or links to domains
-you've whitelisted.
-
-``allow_domains``
- Allow links to the domains in this list. Set to ``None`` or an empty list to
- disallow all non-relative domains. See below for wildcards. Defaults to
- ``None``.
-``allow_relative``
- Allow relative links (i.e. those with no hostname). Defaults to ``False``.
-
-
-Wildcards
-^^^^^^^^^
-
-To allow links to a domain and its subdomains, ``allow_domains`` accepts two
-types of wildcard arguments in domains:
-
-``*``
- Allow a single level of subdomain. This can be anywhere in the hostname, even
- the TLD. This allows you to, for example, allow links to ``example.*``.
- ``*.example.com`` will match both ``foo.example.com`` and ``example.com``.
- ::
- >>> delinkify('<a href="http://foo.ex.mp">bar</a>', \
- ... allow_domains=['*.ex.*'])
- u'<a href="http://foo.ex.mp">bar</a>'
- >>> delinkify('<a href="http://ex.mp">bar</a>', allow_domains=['*.ex.mp'])
- u'<a href="http://ex.mp">bar</a>
-``**``
- To allow any number of *preceding* subdomains, you can start a hostname with
- ``**``. Note that unlike ``*``, ``**`` may only appear once, and only at the
- beginning of a hostname.
- ::
- >>> delinkify('<a href="http://a.b.ex.mp">t</a>', \
- ... allow_domains=['**.ex.mp'])
- u'<a href="http://a.b.ex.mp">t</a>'
- If ``**`` appears anywhere but the beginning of a hostname, ``delinkify``
- will throw ``bleach.ValidationError`` (which is a ``ValueError`` subclass,
- for easy catching).
+Installation
+------------
+
+Bleach is available on PyPI_, so you can install it with ``pip``::
+
+ $ pip install bleach
+
+Or with ``easy_install``::
+
+ $ easy_install bleach
+
+Or by cloning the repo from GitHub_::
+
+ $ git clone git://github.com/jsocol/bleach.git
+
+Then install it by running::
+
+ $ python setup.py install
+
.. _html5lib: http://code.google.com/p/html5lib/
+.. _GitHub: https://github.com/jsocol/bleach
+.. _ReadTheDocs: http://bleach.readthedocs.org/
+.. _PyPI: http://pypi.python.org/pypi/bleach
diff --git a/bleach/__init__.py b/bleach/__init__.py
index bc8e49c..af75d0f 100644
--- a/bleach/__init__.py
+++ b/bleach/__init__.py
@@ -1,19 +1,18 @@
-import itertools
import logging
import re
import sys
-import urlparse
import html5lib
from html5lib.sanitizer import HTMLSanitizer
from html5lib.serializer.htmlserializer import HTMLSerializer
-from encoding import force_unicode
-from sanitizer import BleachSanitizer
+from . import callbacks as linkify_callbacks
+from .encoding import force_unicode
+from .sanitizer import BleachSanitizer
-VERSION = (1, 1, 5)
-__version__ = '.'.join(map(str, VERSION))
+VERSION = (1, 2, 1)
+__version__ = '1.2.1'
__all__ = ['clean', 'linkify']
@@ -56,18 +55,21 @@ TLDS = """ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az
tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xn ye yt yu za zm
zw""".split()
+PROTOCOLS = HTMLSanitizer.acceptable_protocols
+
TLDS.reverse()
url_re = re.compile(
r"""\(* # Match any opening parentheses.
- \b(?<![@.])(?:\w[\w-]*:/{0,3}(?:(?:\w+:)?\w+@)?)? # http://
+ \b(?<![@.])(?:(?:%s):/{0,3}(?:(?:\w+:)?\w+@)?)? # http://
([\w-]+\.)+(?:%s)(?:\:\d+)?(?!\.\w)\b # xx.yy.tld(:##)?
(?:[/?][^\s\{\}\|\\\^\[\]`<>"]*)?
# /path/zz (excluding "unsafe" chars from RFC 1738,
# except for # and ~, which happen in practice)
- """ % u'|'.join(TLDS), re.VERBOSE | re.UNICODE)
+ """ % (u'|'.join(PROTOCOLS), u'|'.join(TLDS)),
+ re.IGNORECASE | re.VERBOSE | re.UNICODE)
-proto_re = re.compile(r'^[\w-]+:/{0,3}')
+proto_re = re.compile(r'^[\w-]+:/{0,3}', re.IGNORECASE)
punct_re = re.compile(r'([\.,]+)$')
@@ -83,7 +85,10 @@ email_re = re.compile(
NODE_TEXT = 4 # The numeric ID of a text node in simpletree.
-identity = lambda x: x # The identity function.
+DEFAULT_CALLBACKS = [linkify_callbacks.nofollow]
+
+PY_26 = (sys.version_info < (2, 7))
+RECURSION_EXCEPTION = RuntimeError if not PY_26 else AttributeError
def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
@@ -93,8 +98,6 @@ def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
return u''
text = force_unicode(text)
- if text.startswith(u'<!--'):
- text = u' ' + text
class s(BleachSanitizer):
allowed_elements = tags
@@ -105,32 +108,17 @@ def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
parser = html5lib.HTMLParser(tokenizer=s)
- return _render(parser.parseFragment(text)).strip()
+ return _render(parser.parseFragment(text))
-def linkify(text, nofollow=True, target=None, filter_url=identity,
- filter_text=identity, skip_pre=False, parse_email=False,
- tokenizer=HTMLSanitizer):
+def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_pre=False,
+ parse_email=False, tokenizer=HTMLSanitizer):
"""Convert URL-like strings in an HTML fragment to links.
linkify() converts strings that look like URLs or domain names in a
blob of text that may be an HTML fragment to links, while preserving
(a) links already in the string, (b) urls found in attributes, and
(c) email addresses.
-
- If the nofollow argument is True (the default) then rel="nofollow"
- will be added to links created by linkify() as well as links already
- found in the text.
-
- The target argument will optionally add a target attribute with the
- given value to links created by linkify() as well as links already
- found in the text.
-
- linkify() uses up to two filters on each link. For links created by
- linkify(), the href attribute is passed through filter_url()
- and the text of the link is passed through filter_text(). For links
- already found in the document, the href attribute is passed through
- filter_url(), but the text is untouched.
"""
text = force_unicode(text)
@@ -141,16 +129,16 @@ def linkify(text, nofollow=True, target=None, filter_url=identity,
forest = parser.parseFragment(text)
- if nofollow:
- rel = u'rel="nofollow"'
- else:
- rel = u''
-
def replace_nodes(tree, new_frag, node):
new_tree = parser.parseFragment(new_frag)
for n in new_tree.childNodes:
+ # Prevent us from re-parsing links new links as existing links.
+ if n.name == 'a':
+ n._seen = True
tree.insertBefore(n, node)
tree.removeChild(node)
+ # Return the number of new nodes.
+ return len(new_tree.childNodes) - 1
def strip_wrapping_parentheses(fragment):
"""Strips wrapping parentheses.
@@ -194,34 +182,77 @@ def linkify(text, nofollow=True, target=None, filter_url=identity,
return fragment, opening_parentheses, closing_parentheses
+ def apply_callbacks(attrs, new):
+ for cb in callbacks:
+ attrs = cb(attrs, new)
+ if attrs is None:
+ return None
+ return attrs
+
def linkify_nodes(tree, parse_text=True):
- for node in tree.childNodes:
+ # I know this isn't Pythonic, but we're sometimes mutating
+ # tree.childNodes, which ends up breaking the loop and causing us to
+ # reparse code.
+ children = len(tree.childNodes)
+ current = 0 # A pointer to the "current" node.
+ while current < children:
+ node = tree.childNodes[current]
if node.type == NODE_TEXT and parse_text:
- new_frag = node.toxml()
+ new_frag = _render(node)
+ # Look for email addresses?
if parse_email:
new_frag = re.sub(email_re, email_repl, new_frag)
- if new_frag != node.toxml():
- replace_nodes(tree, new_frag, node)
+ if new_frag != _render(node):
+ adj = replace_nodes(tree, new_frag, node)
+ children += adj
+ current += adj
linkify_nodes(tree)
continue
new_frag = re.sub(url_re, link_repl, new_frag)
- replace_nodes(tree, new_frag, node)
- elif node.name == 'a':
+ if new_frag != _render(node):
+ adj = replace_nodes(tree, new_frag, node)
+ children += adj
+ current += adj
+ elif node.name == 'a' and not getattr(node, '_seen', False):
if 'href' in node.attributes:
- if nofollow:
- node.attributes['rel'] = 'nofollow'
- if target is not None:
- node.attributes['target'] = target
- href = node.attributes['href']
- node.attributes['href'] = filter_url(href)
+ attrs = node.attributes
+ _text = attrs['_text'] = ''.join(c.toxml() for
+ c in node.childNodes)
+ attrs = apply_callbacks(attrs, False)
+ if attrs is not None:
+ text = force_unicode(attrs.pop('_text'))
+ node.attributes = attrs
+ for n in reversed(node.childNodes):
+ node.removeChild(n)
+ text = parser.parseFragment(text)
+ for n in text.childNodes:
+ node.appendChild(n)
+ node._seen = True
+ else:
+ replace_nodes(tree, _text, node)
elif skip_pre and node.name == 'pre':
linkify_nodes(node, False)
- else:
+ elif not getattr(node, '_seen', False):
linkify_nodes(node)
+ current += 1
def email_repl(match):
- repl = u'<a href="mailto:%(mail)s">%(mail)s</a>'
- return repl % {'mail': match.group(0).replace('"', '&quot;')}
+ addr = match.group(0).replace('"', '&quot;')
+ link = {
+ '_text': addr,
+ 'href': 'mailto:%s' % addr,
+ }
+ link = apply_callbacks(link, True)
+
+ if link is None:
+ return addr
+
+ _href = link.pop('href')
+ _text = link.pop('_text')
+
+ repl = '<a href="%s" %s>%s</a>'
+ attribs = ' '.join('%s="%s"' % (k, v) for k, v in link.items())
+ return repl % (_href, attribs, _text)
def link_repl(match):
url = match.group(0)
@@ -240,98 +271,41 @@ def linkify(text, nofollow=True, target=None, filter_url=identity,
else:
href = u''.join([u'http://', url])
- repl = u'%s<a href="%s" %s>%s</a>%s%s'
-
- attribs = [rel]
- if target is not None:
- attribs.append('target="%s"' % target)
-
- return repl % ('(' * open_brackets,
- filter_url(href), ' '.join(attribs), filter_text(url),
- end, ')' * close_brackets)
+ link = {
+ '_text': url,
+ 'href': href,
+ }
- linkify_nodes(forest)
+ link = apply_callbacks(link, True)
- return _render(forest)
+ if link is None:
+ return url
+ _text = link.pop('_text')
+ _href = link.pop('href')
-def delinkify(text, allow_domains=None, allow_relative=False):
- """Remove links from text, except those allowed to stay."""
- text = force_unicode(text)
- if not text:
- return u''
+ repl = u'%s<a href="%s" %s>%s</a>%s%s'
+ attribs = ' '.join('%s="%s"' % (k, v) for k, v in link.items())
- parser = html5lib.HTMLParser(tokenizer=HTMLSanitizer)
- forest = parser.parseFragment(text)
+ return repl % ('(' * open_brackets,
+ _href, attribs, _text, end,
+ ')' * close_brackets)
- if allow_domains is None:
- allow_domains = []
- elif isinstance(allow_domains, basestring):
- allow_domains = [allow_domains]
+ try:
+ linkify_nodes(forest)
+ except (RECURSION_EXCEPTION), e:
+ # If we hit the max recursion depth, just return what we've got.
+ log.exception('Probable recursion error: %r' % e)
- def delinkify_nodes(tree):
- """Remove <a> tags and replace them with their contents."""
- for node in tree.childNodes:
- if node.name == 'a':
- if 'href' not in node.attributes:
- continue
- parts = urlparse.urlparse(node.attributes['href'])
- host = parts.hostname
- if any(_domain_match(host, d) for d in allow_domains):
- continue
- if host is None and allow_relative:
- continue
- # Replace the node with its children.
- # You can't nest <a> tags, and html5lib takes care of that
- # for us in the tree-building step.
- for n in node.childNodes:
- tree.insertBefore(n, node)
- tree.removeChild(node)
- elif node.type != NODE_TEXT: # Don't try to delinkify text.
- delinkify_nodes(node)
-
- delinkify_nodes(forest)
return _render(forest)
-def _domain_match(test, compare):
- test = test.lower()
- compare = compare.lower()
- if '*' not in compare:
- return test == compare
- c = compare.split('.')[::-1]
- if '**' in c and (c.count('**') > 1 or not compare.startswith('**')):
- raise ValidationError(
- 'Only 1 ** is allowed, and must start the domain.')
- t = test.split('.')[::-1]
- z = itertools.izip_longest(c, t)
- for c, t in z:
- if c == t:
- continue
- elif c == '*':
- continue
- elif c == '**':
- return True
- return False
- # Got all the way through and everything matched.
- return True
-
-
-class ValidationError(ValueError):
- pass
-
-
def _render(tree):
"""Try rendering as HTML, then XML, then give up."""
try:
return force_unicode(_serialize(tree))
- except Exception, e:
- log.error('HTML: %r' % e, exc_info=sys.exc_info())
- try:
- return force_unicode(tree.toxml())
- except Exception, e:
- log.error('XML: %r' % e, exc_info=sys.exc_info())
- return u''
+ except AssertionError: # The treewalker throws this sometimes.
+ return force_unicode(tree.toxml())
def _serialize(domtree):
diff --git a/bleach/callbacks.py b/bleach/callbacks.py
new file mode 100644
index 0000000..cc4682d
--- /dev/null
+++ b/bleach/callbacks.py
@@ -0,0 +1,15 @@
+"""A set of basic callbacks for bleach.linkify."""
+
+
+def nofollow(attrs, new=False):
+ if attrs['href'].startswith('mailto:'):
+ return attrs
+ attrs['rel'] = 'nofollow'
+ return attrs
+
+
+def target_blank(attrs, new=False):
+ if attrs['href'].startswith('mailto:'):
+ return attrs
+ attrs['target'] = '_blank'
+ return attrs
diff --git a/bleach/sanitizer.py b/bleach/sanitizer.py
index 677287e..4640012 100644
--- a/bleach/sanitizer.py
+++ b/bleach/sanitizer.py
@@ -6,6 +6,10 @@ from html5lib.sanitizer import HTMLSanitizerMixin
from html5lib.tokenizer import HTMLTokenizer
+PROTOS = HTMLSanitizerMixin.acceptable_protocols
+PROTOS.remove('feed')
+
+
class BleachSanitizerMixin(HTMLSanitizerMixin):
"""Mixin to replace sanitize_token() and sanitize_css()."""
@@ -108,7 +112,7 @@ class BleachSanitizerMixin(HTMLSanitizerMixin):
# TODO: Make sure this does what it's meant to - I *think* it wants to
# validate style attribute contents.
parts = style.split(';')
- gauntlet = re.compile("""^([-/:,#%.'\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'\s*"""
+ gauntlet = re.compile("""^([-/:,#%.'"\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'\s*"""
"""|"[\s\w]+"|\([\d,%\.\s]+\))*$""")
for part in parts:
if not gauntlet.match(part):
diff --git a/bleach/tests/test_basics.py b/bleach/tests/test_basics.py
index 60be11d..9eca687 100644
--- a/bleach/tests/test_basics.py
+++ b/bleach/tests/test_basics.py
@@ -8,6 +8,10 @@ def test_empty():
eq_('', bleach.clean(''))
+def test_nbsp():
+ eq_(u'\xa0test string\xa0', bleach.clean('&nbsp;test string&nbsp;'))
+
+
def test_comments_only():
comment = '<!-- this is a comment -->'
open_comment = '<!-- this is an open comment'
@@ -91,7 +95,6 @@ def test_serializer():
def test_no_href_links():
s = u'<a name="anchor">x</a>'
eq_(s, bleach.linkify(s))
- eq_(s, bleach.linkify(s, nofollow=False))
def test_weird_strings():
diff --git a/bleach/tests/test_css.py b/bleach/tests/test_css.py
index fdb3f65..588c8ce 100644
--- a/bleach/tests/test_css.py
+++ b/bleach/tests/test_css.py
@@ -22,13 +22,21 @@ def test_allowed_css():
('cursor: -moz-grab;', 'cursor: -moz-grab;', ['cursor']),
('color: hsl(30,100%,50%);', 'color: hsl(30,100%,50%);', ['color']),
('color: rgba(255,0,0,0.4);', 'color: rgba(255,0,0,0.4);', ['color']),
- ("text-overflow: ',' ellipsis;", "text-overflow: ',' ellipsis;", ['text-overflow']),
+ ("text-overflow: ',' ellipsis;", "text-overflow: ',' ellipsis;",
+ ['text-overflow']),
+ ('text-overflow: "," ellipsis;', 'text-overflow: "," ellipsis;',
+ ['text-overflow']),
+ ('font-family: "Arial";', 'font-family: "Arial";', ['font-family']),
)
- p = '<p style="%s">bar</p>'
+ p_single = '<p style="%s">bar</p>'
+ p_double = "<p style='%s'>bar</p>"
- def check(input, output, styles):
- eq_(p % output, clean(p % input, styles=styles))
+ def check(i, o, s):
+ if '"' in i:
+ eq_(p_double % o, clean(p_double % i, styles=s))
+ else:
+ eq_(p_single % o, clean(p_single % i, styles=s))
for i, o, s in tests:
yield check, i, o, s
diff --git a/bleach/tests/test_delinkify.py b/bleach/tests/test_delinkify.py
deleted file mode 100644
index f216d2f..0000000
--- a/bleach/tests/test_delinkify.py
+++ /dev/null
@@ -1,109 +0,0 @@
-from nose.tools import eq_
-
-import bleach
-
-
-def test_delinkify():
- eq_('test', bleach.delinkify('<a href="http://ex.mp">test</a>'))
- eq_('footestbar',
- bleach.delinkify('foo<a href="http://ex.mp">test</a>bar'))
-
-
-def test_whitelist():
- html = '<a href="http://ex.mp">test</a>'
- eq_(html, bleach.delinkify(html, allow_domains=['ex.mp']))
- eq_('test', bleach.delinkify(html, allow_domains=['ex2.mp']))
- # Allow a single domain as a special case.
- eq_(html, bleach.delinkify(html, allow_domains='ex.mp'))
-
-
-def test_nested_a():
- html = '<a href="http://ex.mp">test<a href="http://foo.bar">test</a></a>'
- eq_('testtest', bleach.delinkify(html))
- eq_('<a href="http://ex.mp">test</a>test',
- bleach.delinkify(html, allow_domains=['ex.mp']))
-
-
-def test_nested_tag():
- html = '<a href="http://ex.mp">test<span>test</span></a>'
- eq_('test<span>test</span>', bleach.delinkify(html))
-
-
-def test_a_name():
- """Don't screw with non-link <a> tags."""
- html = '<a name="foo">bar</a>'
- eq_(html, bleach.delinkify(html))
-
-
-def test_relative():
- """Relative links are optionally OK."""
- html = 'some <a href="/foo/bar">link</a>'
- eq_('some link', bleach.delinkify(html))
- eq_(html, bleach.delinkify(html, allow_relative=True))
-
-
-def test_protocol_relative():
- """Protocol-relative links aren't relative."""
- html = 'bad <a href="//ex.mp">link</a>'
- expect = 'bad link'
- eq_(expect, bleach.delinkify(html))
- eq_(expect, bleach.delinkify(html, allow_relative=True))
- eq_(html, bleach.delinkify(html, allow_domains='ex.mp'))
-
-
-def test_domain_match():
- tests = (
- ('ex.mp', 'ex.mp', True),
- ('ex.mp', '*.ex.mp', True),
- ('test.ex.mp', '*.ex.mp', True),
- ('test.ex.mp', 'ex.mp', False),
- ('test.test.ex.mp', '*.ex.mp', False),
- ('test.test.ex.mp', '**.ex.mp', True),
- ('wrong.mp', 'ex.mp', False),
- ('wrong.mp', '*.ex.mp', False),
- ('really.wrong.mp', 'ex.mp', False),
- ('really.wrong.mp', '*.ex.mp', False),
- ('really.very.wrong.mp', '*.ex.mp', False),
- ('EX.mp', 'ex.mp', True), # Domains are case-insensitive.
- ('ex.mp', 'an.ex.mp', False),
- ('ex.mp', '*.an.ex.mp', False),
- ('an.ex.am.pl', 'an.*.am.pl', True),
- ('a.ex.am.pl', 'an.*.am.pl', False),
- ('ex.am.pl', 'an.*.am.pl', False),
- )
-
- def _check(t, c, v):
- eq_(v, bleach._domain_match(t, c))
-
- for t, c, v in tests:
- yield _check, t, c, v
-
-
-def test_double_star():
- assert bleach._domain_match('ex.mp', '**.ex.mp')
- try:
- bleach._domain_match('ex.mp', 'an.**.ex.mp')
- except bleach.ValidationError:
- pass
- else:
- assert False, '_domain_match should not accept an.**.ex.mp'
-
-
-def test_allow_subdomains():
- domains = ('ex.mp', '*.exa.mp', 'an.exam.pl', '*.my.examp.le')
- html = (
- ('<a href="http://an.ex.mp">bad</a>', 'bad'),
- ('<a href="http://exa.mp">good</a>', None),
- ('<a href="http://an.exa.mp">good</a>', None),
- ('<a href="http://an.exam.pl">good</a>', None),
- ('<a href="http://another.exam.pl">bad</a>', 'bad'),
- ('<a href="http://a.bad.examp.le">bad</a>', 'bad'),
- ('<a href="http://a.very.bad.examp.le">bad</a>', 'bad'),
- )
-
- def _check(html, text):
- output = bleach.delinkify(html, allow_domains=domains)
- eq_(html if text is None else text, output)
-
- for t, o in html:
- yield _check, t, o
diff --git a/bleach/tests/test_links.py b/bleach/tests/test_links.py
index 7caf006..ac593c4 100644
--- a/bleach/tests/test_links.py
+++ b/bleach/tests/test_links.py
@@ -3,11 +3,9 @@ import urllib
from html5lib.tokenizer import HTMLTokenizer
from nose.tools import eq_
-from bleach import linkify, url_re
+from bleach import linkify, url_re, DEFAULT_CALLBACKS as DC
-def filter_url(url):
- return u'http://bouncer/?u=%s' % urllib.quote_plus(url)
def test_url_re():
@@ -45,38 +43,139 @@ def test_trailing_slash():
def test_mangle_link():
+ """We can muck with the href attribute of the link."""
+ def filter_url(attrs, new=False):
+ attrs['href'] = (u'http://bouncer/?u=%s' %
+ urllib.quote_plus(attrs['href']))
+ return attrs
+
eq_('<a href="http://bouncer/?u=http%3A%2F%2Fexample.com" rel="nofollow">'
'http://example.com</a>',
- linkify('http://example.com', filter_url=filter_url))
+ linkify('http://example.com', DC + [filter_url]))
+
+
+def test_mangle_text():
+ """We can muck with the inner text of a link."""
+
+ def ft(attrs, new=False):
+ attrs['_text'] = 'bar'
+ return attrs
+
+ eq_('<a href="http://ex.mp">bar</a> <a href="http://ex.mp/foo">bar</a>',
+ linkify('http://ex.mp <a href="http://ex.mp/foo">foo</a>', [ft]))
def test_email_link():
- eq_('a james@example.com mailto',
- linkify('a james@example.com mailto'))
- eq_('a james@example.com.au mailto',
- linkify('a james@example.com.au mailto'))
- eq_('a <a href="mailto:james@example.com" rel="nofollow">'
- 'james@example.com</a> mailto',
- linkify('a james@example.com mailto', parse_email=True))
- eq_('aussie <a href="mailto:james@example.com.au" rel="nofollow">'
- 'james@example.com.au</a> mailto',
- linkify('aussie james@example.com.au mailto', parse_email=True))
- eq_('email to <a href="james@example.com" rel="nofollow">'
- 'james@example.com</a>',
- linkify('email to <a href="james@example.com">'
- 'james@example.com</a>', parse_email=True))
+ tests = (
+ ('a james@example.com mailto', False, 'a james@example.com mailto'),
+ ('a james@example.com.au mailto', False,
+ 'a james@example.com.au mailto'),
+ ('a <a href="mailto:james@example.com">james@example.com</a> mailto',
+ True, 'a james@example.com mailto'),
+ ('aussie <a href="mailto:james@example.com.au">'
+ 'james@example.com.au</a> mailto', True,
+ 'aussie james@example.com.au mailto'),
+ # This is kind of a pathological case. I guess we do our best here.
+ ('email to <a href="james@example.com" rel="nofollow">'
+ 'james@example.com</a>', True,
+ 'email to <a href="james@example.com">james@example.com</a>'),
+ )
+
+ def _check(o, p, i):
+ eq_(o, linkify(i, parse_email=p))
+
+ for (o, p, i) in tests:
+ yield _check, o, p, i
def test_email_link_escaping():
- eq_('''<a href='mailto:"james"@example.com' rel="nofollow">'''
- '''"james"@example.com</a>''',
- linkify('"james"@example.com', parse_email=True))
- eq_('''<a href="mailto:&quot;j'ames&quot;@example.com" rel="nofollow">'''
- '''"j'ames"@example.com</a>''',
- linkify('"j\'ames"@example.com', parse_email=True))
- eq_('''<a href='mailto:"ja>mes"@example.com' rel="nofollow">'''
- '''"ja&gt;mes"@example.com</a>''',
- linkify('"ja>mes"@example.com', parse_email=True))
+ tests = (
+ ('''<a href='mailto:"james"@example.com'>'''
+ '''"james"@example.com</a>''',
+ '"james"@example.com'),
+ ('''<a href="mailto:&quot;j'ames&quot;@example.com">'''
+ '''"j'ames"@example.com</a>''',
+ '"j\'ames"@example.com'),
+ ('''<a href='mailto:"ja>mes"@example.com'>'''
+ '''"ja&gt;mes"@example.com</a>''',
+ '"ja>mes"@example.com'),
+ )
+
+ def _check(o, i):
+ eq_(o, linkify(i, parse_email=True))
+
+ for (o, i) in tests:
+ yield _check, o, i
+
+
+def test_prevent_links():
+ """Returning None from any callback should remove links or prevent them
+ from being created."""
+
+ def no_new_links(attrs, new=False):
+ if new:
+ return None
+ return attrs
+
+ def no_old_links(attrs, new=False):
+ if not new:
+ return None
+ return attrs
+
+ def noop(attrs, new=False):
+ return attrs
+
+ in_text = 'a ex.mp <a href="http://example.com">example</a>'
+ out_text = 'a <a href="http://ex.mp">ex.mp</a> example'
+ tests = (
+ ([noop], ('a <a href="http://ex.mp">ex.mp</a> '
+ '<a href="http://example.com">example</a>'), 'noop'),
+ ([no_new_links, noop], in_text, 'no new, noop'),
+ ([noop, no_new_links], in_text, 'noop, no new'),
+ ([no_old_links, noop], out_text, 'no old, noop'),
+ ([noop, no_old_links], out_text, 'noop, no old'),
+ ([no_old_links, no_new_links], 'a ex.mp example', 'no links'),
+ )
+
+ def _check(cb, o, msg):
+ eq_(o, linkify(in_text, cb), msg)
+
+ for (cb, o, msg) in tests:
+ yield _check, cb, o, msg
+
+
+def test_set_attrs():
+ """We can set random attributes on links."""
+
+ def set_attr(attrs, new=False):
+ attrs['rev'] = 'canonical'
+ return attrs
+
+ eq_('<a href="http://ex.mp" rev="canonical">ex.mp</a>',
+ linkify('ex.mp', [set_attr]))
+
+
+def test_only_proto_links():
+ """Only create links if there's a protocol."""
+ def only_proto(attrs, new=False):
+ if new and not attrs['_text'].startswith(('http:', 'https:')):
+ return None
+ return attrs
+
+ in_text = 'a ex.mp http://ex.mp <a href="/foo">bar</a>'
+ out_text = ('a ex.mp <a href="http://ex.mp">http://ex.mp</a> '
+ '<a href="/foo">bar</a>')
+ eq_(out_text, linkify(in_text, [only_proto]))
+
+
+def test_stop_email():
+ """Returning None should prevent a link from being created."""
+ def no_email(attrs, new=False):
+ if attrs['href'].startswith('mailto:'):
+ return None
+ return attrs
+ text = 'do not link james@example.com'
+ eq_(text, linkify(text, parse_email=True, callbacks=[no_email]))
def test_tlds():
@@ -98,7 +197,7 @@ def test_escaping():
def test_nofollow_off():
eq_('<a href="http://example.com">example.com</a>',
- linkify(u'example.com', nofollow=False))
+ linkify(u'example.com', []))
def test_link_in_html():
@@ -297,16 +396,45 @@ def test_ports():
yield check, test, output
-def test_target():
- eq_('<a href="http://example.com" rel="nofollow" '
- 'target="_blank">example.com</a>',
- linkify(u'example.com', target='_blank'))
- eq_('<a href="http://example.com" target="_blank">example.com</a>',
- linkify(u'example.com', target='_blank', nofollow=False))
-
-
def test_tokenizer():
"""Linkify doesn't always have to sanitize."""
raw = '<em>test<x></x></em>'
eq_('<em>test&lt;x&gt;&lt;/x&gt;</em>', linkify(raw))
eq_(raw, linkify(raw, tokenizer=HTMLTokenizer))
+
+
+def test_ignore_bad_protocols():
+ eq_('foohttp://bar',
+ linkify('foohttp://bar'))
+ eq_('foohttp://<a href="http://exampl.com" rel="nofollow">exampl.com</a>',
+ linkify('foohttp://exampl.com'))
+
+
+def test_max_recursion_depth():
+ """If we hit the max recursion depth, just return the string."""
+ test = '<em>' * 2000 + 'foo' + '</em>' * 2000
+ eq_(test, linkify(test))
+
+
+def test_link_emails_and_urls():
+ """parse_email=True shouldn't prevent URLs from getting linkified."""
+ output = ('<a href="http://example.com" rel="nofollow">'
+ 'http://example.com</a> <a href="mailto:person@example.com">'
+ 'person@example.com</a>')
+ eq_(output, linkify('http://example.com person@example.com',
+ parse_email=True))
+
+
+def test_links_case_insensitive():
+ """Protocols and domain names are case insensitive."""
+ expect = ('<a href="HTTP://EXAMPLE.COM" rel="nofollow">'
+ 'HTTP://EXAMPLE.COM</a>')
+ eq_(expect, linkify('HTTP://EXAMPLE.COM'))
+
+
+def test_elements_inside_links():
+ eq_(u'<a href="#" rel="nofollow">hello<br></a>',
+ linkify('<a href="#">hello<br></a>'))
+
+ eq_(u'<a href="#" rel="nofollow"><strong>bold</strong> hello<br></a>',
+ linkify('<a href="#"><strong>bold</strong> hello<br></a>'))
diff --git a/bleach/tests/test_security.py b/bleach/tests/test_security.py
index 9e9bb7b..6c2b33f 100644
--- a/bleach/tests/test_security.py
+++ b/bleach/tests/test_security.py
@@ -106,3 +106,7 @@ def test_poster_attribute():
eq_(expect, clean(test, tags=tags, attributes=attrs))
ok = '<video poster="/foo.png"></video>'
eq_(ok, clean(ok, tags=tags, attributes=attrs))
+
+
+def test_feed_protocol():
+ eq_('<a>foo</a>', clean('<a href="feed:file:///tmp/foo">foo</a>'))
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 0000000..81ad9f9
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,153 @@
+# Makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS =
+SPHINXBUILD = sphinx-build
+PAPER =
+BUILDDIR = _build
+
+# Internal variables.
+PAPEROPT_a4 = -D latex_paper_size=a4
+PAPEROPT_letter = -D latex_paper_size=letter
+ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
+# the i18n builder cannot share the environment and doctrees with the others
+I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
+
+.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
+
+help:
+ @echo "Please use \`make <target>' where <target> is one of"
+ @echo " html to make standalone HTML files"
+ @echo " dirhtml to make HTML files named index.html in directories"
+ @echo " singlehtml to make a single large HTML file"
+ @echo " pickle to make pickle files"
+ @echo " json to make JSON files"
+ @echo " htmlhelp to make HTML files and a HTML help project"
+ @echo " qthelp to make HTML files and a qthelp project"
+ @echo " devhelp to make HTML files and a Devhelp project"
+ @echo " epub to make an epub"
+ @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
+ @echo " latexpdf to make LaTeX files and run them through pdflatex"
+ @echo " text to make text files"
+ @echo " man to make manual pages"
+ @echo " texinfo to make Texinfo files"
+ @echo " info to make Texinfo files and run them through makeinfo"
+ @echo " gettext to make PO message catalogs"
+ @echo " changes to make an overview of all changed/added/deprecated items"
+ @echo " linkcheck to check all external links for integrity"
+ @echo " doctest to run all doctests embedded in the documentation (if enabled)"
+
+clean:
+ -rm -rf $(BUILDDIR)/*
+
+html:
+ $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
+ @echo
+ @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
+
+dirhtml:
+ $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
+ @echo
+ @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
+
+singlehtml:
+ $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
+ @echo
+ @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
+
+pickle:
+ $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
+ @echo
+ @echo "Build finished; now you can process the pickle files."
+
+json:
+ $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
+ @echo
+ @echo "Build finished; now you can process the JSON files."
+
+htmlhelp:
+ $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
+ @echo
+ @echo "Build finished; now you can run HTML Help Workshop with the" \
+ ".hhp project file in $(BUILDDIR)/htmlhelp."
+
+qthelp:
+ $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
+ @echo
+ @echo "Build finished; now you can run "qcollectiongenerator" with the" \
+ ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
+ @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Bleach.qhcp"
+ @echo "To view the help file:"
+ @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Bleach.qhc"
+
+devhelp:
+ $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
+ @echo
+ @echo "Build finished."
+ @echo "To view the help file:"
+ @echo "# mkdir -p $$HOME/.local/share/devhelp/Bleach"
+ @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Bleach"
+ @echo "# devhelp"
+
+epub:
+ $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
+ @echo
+ @echo "Build finished. The epub file is in $(BUILDDIR)/epub."
+
+latex:
+ $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
+ @echo
+ @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
+ @echo "Run \`make' in that directory to run these through (pdf)latex" \
+ "(use \`make latexpdf' here to do that automatically)."
+
+latexpdf:
+ $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
+ @echo "Running LaTeX files through pdflatex..."
+ $(MAKE) -C $(BUILDDIR)/latex all-pdf
+ @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
+
+text:
+ $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
+ @echo
+ @echo "Build finished. The text files are in $(BUILDDIR)/text."
+
+man:
+ $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
+ @echo
+ @echo "Build finished. The manual pages are in $(BUILDDIR)/man."
+
+texinfo:
+ $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
+ @echo
+ @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
+ @echo "Run \`make' in that directory to run these through makeinfo" \
+ "(use \`make info' here to do that automatically)."
+
+info:
+ $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
+ @echo "Running Texinfo files through makeinfo..."
+ make -C $(BUILDDIR)/texinfo info
+ @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
+
+gettext:
+ $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
+ @echo
+ @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
+
+changes:
+ $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
+ @echo
+ @echo "The overview file is in $(BUILDDIR)/changes."
+
+linkcheck:
+ $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
+ @echo
+ @echo "Link check complete; look for any errors in the above output " \
+ "or in $(BUILDDIR)/linkcheck/output.txt."
+
+doctest:
+ $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
+ @echo "Testing of doctests in the sources finished, look at the " \
+ "results in $(BUILDDIR)/doctest/output.txt."
diff --git a/docs/clean.rst b/docs/clean.rst
new file mode 100644
index 0000000..a31dc89
--- /dev/null
+++ b/docs/clean.rst
@@ -0,0 +1,122 @@
+.. _clean-chapter:
+.. highlightlang:: python
+
+==================
+``bleach.clean()``
+==================
+
+``clean()`` is Bleach's HTML sanitization method::
+
+ def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
+ styles=ALLOWED_STYLES, strip=False, strip_comments=True):
+ """Clean an HTML fragment and return it."""
+
+Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing
+algorithm and sanitize any disallowed tags or attributes. This algorithm also
+takes care of things like unclosed and (some) misnested tags.
+
+.. note::
+ You may pass in a ``string`` or a ``unicode`` object, but Bleach will
+ always return ``unicode``.
+
+
+Tag Whitelist
+=============
+
+The ``tags`` kwarg is a whitelist of allowed HTML tags. It should be a list,
+tuple, or other iterable. Any other HTML tags will be escaped or stripped from
+the text. Its default value is a relatively conservative list found in
+``bleach.ALLOWED_TAGS``.
+
+
+Attribute Whitelist
+===================
+
+The ``attributes`` kwarg is a whitelist of attributes. It can be a list, in
+which case the attributes are allowed for any tag, or a dictionary, in which
+case the keys are tag names (or a wildcard: ``*`` for all tags) and the values
+are lists of allowed attributes.
+
+For example::
+
+ attrs = {
+ '*': ['class'],
+ 'a': ['href', 'rel'],
+ 'img': ['src', 'alt'],
+ }
+
+In this case, ``class`` is allowed on any allowed element (from the ``tags``
+argument), ``<a>`` tags are allowed to have ``href`` and ``rel`` attributes,
+and so on.
+
+The default value is also a conservative dict found in
+``bleach.ALLOWED_ATTRIBUTES``.
+
+
+Callable Filters
+----------------
+
+You can also use a callable (instead of a list) in the ``attributes`` kwarg. If
+the callable returns ``True``, the attribute is allowed. Otherwise, it is
+stripped. For example::
+
+ def filter_src(name, value):
+ if name in ('alt', 'height', 'width'):
+ return True
+ if name == 'src':
+ p = urlparse(value)
+ return (not p.netloc) or p.netloc == 'mydomain.com'
+ return False
+
+ attrs = {
+ 'img': filter_src,
+ }
+
+
+Styles Whitelist
+================
+
+If you allow the ``style`` attribute, you will also need to whitelist styles
+users are allowed to set, for example ``color`` and ``background-color``.
+
+The default value is an empty list, i.e., the ``style`` attribute will be
+allowed but no values will be.
+
+For example, to allow users to set the color and font-weight of text::
+
+ attrs = {
+ '*': 'style'
+ }
+ tags = ['p', 'em', 'strong']
+ styles = ['color', 'font-weight']
+ cleaned_text = bleach.clean(text, tags, attrs, styles)
+
+
+Stripping Markup
+================
+
+By default, Bleach *escapes* disallowed or invalid markup. For example::
+
+ >>> bleach.clean('<span>is not allowed</span>')
+ u'&lt;span&gt;is not allowed&lt;/span&gt;
+
+If you would rather Bleach stripped this markup entirely, you can pass
+``strip=True``::
+
+ >>> bleach.clean('<span>is not allowed</span>', strip=True)
+ u'is not allowed'
+
+
+Stripping Comments
+==================
+
+By default, Bleach will strip out HTML comments. To disable this behavior, set
+``strip_comments=False``::
+
+ >>> html = 'my<!-- commented --> html'
+
+ >>> bleach.clean(html)
+ u'my html'
+
+ >>> bleach.clean(html, strip_comments=False)
+ u'my<!-- commented --> html'
diff --git a/docs/conf.py b/docs/conf.py
new file mode 100644
index 0000000..a63aedf
--- /dev/null
+++ b/docs/conf.py
@@ -0,0 +1,242 @@
+# -*- coding: utf-8 -*-
+#
+# Bleach documentation build configuration file, created by
+# sphinx-quickstart on Fri May 11 21:11:39 2012.
+#
+# This file is execfile()d with the current directory set to its containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+import sys, os
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#sys.path.insert(0, os.path.abspath('.'))
+
+# -- General configuration -----------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be extensions
+# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
+extensions = ['sphinx.ext.autodoc', 'sphinx.ext.pngmath', 'sphinx.ext.viewcode']
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix of source filenames.
+source_suffix = '.rst'
+
+# The encoding of source files.
+#source_encoding = 'utf-8-sig'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = u'Bleach'
+copyright = u'2012, James Socol'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+version = '1.2'
+# The full version, including alpha/beta/rc tags.
+release = '1.2.0'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#language = None
+
+# There are two options for replacing |today|: either, you set today to some
+# non-false value, then it is used:
+#today = ''
+# Else, today_fmt is used as the format for a strftime call.
+#today_fmt = '%B %d, %Y'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+exclude_patterns = ['_build']
+
+# The reST default role (used for this markup: `text`) to use for all documents.
+#default_role = None
+
+# If true, '()' will be appended to :func: etc. cross-reference text.
+#add_function_parentheses = True
+
+# If true, the current module name will be prepended to all description
+# unit titles (such as .. function::).
+#add_module_names = True
+
+# If true, sectionauthor and moduleauthor directives will be shown in the
+# output. They are ignored by default.
+#show_authors = False
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# A list of ignored prefixes for module index sorting.
+#modindex_common_prefix = []
+
+
+# -- Options for HTML output ---------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+html_theme = 'default'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+#html_theme_options = {}
+
+# Add any paths that contain custom themes here, relative to this directory.
+#html_theme_path = []
+
+# The name for this set of Sphinx documents. If None, it defaults to
+# "<project> v<release> documentation".
+#html_title = None
+
+# A shorter title for the navigation bar. Default is the same as html_title.
+#html_short_title = None
+
+# The name of an image file (relative to this directory) to place at the top
+# of the sidebar.
+#html_logo = None
+
+# The name of an image file (within the static path) to use as favicon of the
+# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
+# pixels large.
+#html_favicon = None
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
+# using the given strftime format.
+#html_last_updated_fmt = '%b %d, %Y'
+
+# If true, SmartyPants will be used to convert quotes and dashes to
+# typographically correct entities.
+#html_use_smartypants = True
+
+# Custom sidebar templates, maps document names to template names.
+#html_sidebars = {}
+
+# Additional templates that should be rendered to pages, maps page names to
+# template names.
+#html_additional_pages = {}
+
+# If false, no module index is generated.
+#html_domain_indices = True
+
+# If false, no index is generated.
+#html_use_index = True
+
+# If true, the index is split into individual pages for each letter.
+#html_split_index = False
+
+# If true, links to the reST sources are added to the pages.
+#html_show_sourcelink = True
+
+# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
+#html_show_sphinx = True
+
+# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
+#html_show_copyright = True
+
+# If true, an OpenSearch description file will be output, and all pages will
+# contain a <link> tag referring to it. The value of this option must be the
+# base URL from which the finished HTML is served.
+#html_use_opensearch = ''
+
+# This is the file name suffix for HTML files (e.g. ".xhtml").
+#html_file_suffix = None
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'Bleachdoc'
+
+
+# -- Options for LaTeX output --------------------------------------------------
+
+latex_elements = {
+# The paper size ('letterpaper' or 'a4paper').
+#'papersize': 'letterpaper',
+
+# The font size ('10pt', '11pt' or '12pt').
+#'pointsize': '10pt',
+
+# Additional stuff for the LaTeX preamble.
+#'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title, author, documentclass [howto/manual]).
+latex_documents = [
+ ('index', 'Bleach.tex', u'Bleach Documentation',
+ u'James Socol', 'manual'),
+]
+
+# The name of an image file (relative to this directory) to place at the top of
+# the title page.
+#latex_logo = None
+
+# For "manual" documents, if this is true, then toplevel headings are parts,
+# not chapters.
+#latex_use_parts = False
+
+# If true, show page references after internal links.
+#latex_show_pagerefs = False
+
+# If true, show URL addresses after external links.
+#latex_show_urls = False
+
+# Documents to append as an appendix to all manuals.
+#latex_appendices = []
+
+# If false, no module index is generated.
+#latex_domain_indices = True
+
+
+# -- Options for manual page output --------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+ ('index', 'bleach', u'Bleach Documentation',
+ [u'James Socol'], 1)
+]
+
+# If true, show URL addresses after external links.
+#man_show_urls = False
+
+
+# -- Options for Texinfo output ------------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+# dir menu entry, description, category)
+texinfo_documents = [
+ ('index', 'Bleach', u'Bleach Documentation',
+ u'James Socol', 'Bleach', 'One line description of project.',
+ 'Miscellaneous'),
+]
+
+# Documents to append as an appendix to all manuals.
+#texinfo_appendices = []
+
+# If false, no module index is generated.
+#texinfo_domain_indices = True
+
+# How to display URL addresses: 'footnote', 'no', or 'inline'.
+#texinfo_show_urls = 'footnote'
diff --git a/docs/goals.rst b/docs/goals.rst
new file mode 100644
index 0000000..5477f9c
--- /dev/null
+++ b/docs/goals.rst
@@ -0,0 +1,76 @@
+===============
+Goals of Bleach
+===============
+
+This document lists the goals and non-goals of Bleach. My hope is that by
+focusing on these goals and explicitly listing the non-goals, the project will
+evolve in a stronger direction.
+
+
+Goals
+=====
+
+
+Whitelisting
+------------
+
+Bleach should always take a whitelist-based approach to allowing any kind of
+content or markup. Blacklisting is error-prone and not future proof.
+
+For example, you should have to opt-in to allowing the ``onclick`` attribute,
+not blacklist all the other ``on*`` attributes. Future versions of HTML may add
+new event handlers, like ``ontouch``, that old blacklists would not prevent.
+
+
+Sanitizing Input
+----------------
+
+The primary goal of Bleach is to sanitize user input that is allowed to contain
+*some* HTML as markup and is to be included in the content of a larger page.
+Examples might include:
+
+* User comments on a blog.
+
+* "Bio" sections of a user profile.
+
+* Descriptions of a product or application.
+
+These examples, and others, are traditionally prone to security issues like XSS
+or other script injection, or annoying issues like unclosed tags and invalid
+markup. Bleach will take a proactive, whitelist-only approach to allowing HTML
+content, and will use the HTML5 parsing algorithm to handle invalid markup.
+
+See the :ref:`chapter on clean() <clean-chapter>` for more info.
+
+
+Safely Creating Links
+---------------------
+
+The secondary goal of Bleach is to provide a mechanism for finding or altering
+links (``<a>`` tags with ``href`` attributes, or things that look like URLs or
+email addresses) in text.
+
+While Bleach itself will always operate on a whitelist-based security model,
+the :ref:`linkify() method <linkify-chapter>` is flexible enough to allow the
+creation, alteration, and removal of links based on an extremely wide range of
+use cases.
+
+
+Non-Goals
+=========
+
+Bleach is designed to work with fragments of HTML by untrusted users. Some
+non-goal use cases include:
+
+* **Sanitizing complete HTML documents.** Once you're creating whole documents,
+ you have to allow so many tags that a blacklist approach (e.g. forbidding
+ ``<script>`` or ``<object>``) may be more appropriate.
+
+* **Cleaning up after trusted users.** Bleach is powerful but it is not fast.
+ If you trust your users, trust them and don't rely on Bleach to clean up
+ their mess.
+
+* **Allowing arbitrary styling.** There are a number of interesting CSS
+ properties that can do dangerous things, like Opera's ``-o-link``. Painful as
+ it is, if you want your users to be able to change nearly anything in a
+ ``style`` attribute, you should have to opt into this.
diff --git a/docs/index.rst b/docs/index.rst
new file mode 100644
index 0000000..0929e53
--- /dev/null
+++ b/docs/index.rst
@@ -0,0 +1,69 @@
+.. Bleach documentation master file, created by
+ sphinx-quickstart on Fri May 11 21:11:39 2012.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+
+Welcome to Bleach's documentation!
+==================================
+
+Bleach is a whitelist-based HTML sanitization and text linkification library.
+It is designed to take untrusted user input with *some* HTML.
+
+Because Bleach uses html5lib_ to parse document fragments the same way browsers
+do, it is extremely resilient to unknown attacks, much more so than
+regular-expression-based sanitizers.
+
+Bleach's ``linkify`` function is highly configurable and can be used to find,
+edit, and filter links most other auto-linkers can't.
+
+The version of bleach on GitHub_ is the always the most up-to-date and the
+``master`` branch should always work.
+
+.. warn::
+
+ Bleach is currently incompatible with html5lib 1.0b and any versions below
+ 0.9.5.
+
+
+Installing Bleach
+=================
+
+Bleach is available on PyPI_, so you can install it with ``pip``::
+
+ $ pip install bleach
+
+Or with ``easy_install``::
+
+ $ easy_install bleach
+
+Or by cloning the repo from GitHub_::
+
+ $ git clone git://github.com/jsocol/bleach.git
+
+Then install it by running::
+
+ $ python setup.py install
+
+
+Contents:
+=========
+
+.. toctree::
+ :maxdepth: 2
+
+ clean
+ linkify
+ goals
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+
+.. _html5lib: http://code.google.com/p/html5lib/
+.. _GitHub: https://github.com/jsocol/bleach
+.. _PyPI: http://pypi.python.org/pypi/bleach
diff --git a/docs/linkify.rst b/docs/linkify.rst
new file mode 100644
index 0000000..42de69c
--- /dev/null
+++ b/docs/linkify.rst
@@ -0,0 +1,212 @@
+.. _linkify-chapter:
+.. highlightlang:: python
+
+====================
+``bleach.linkify()``
+====================
+
+``linkify()`` searches text for links, URLs, and email addresses and lets you
+control how and when those links are rendered::
+
+ def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_pre=False,
+ parse_email=False, tokenizer=HTMLSanitizer):
+ """Convert URL-like strings in an HTML fragment to links.
+
+``linkify()`` works by building a document tree, so it's guaranteed never to do
+weird things to URLs in attribute values, can modify the value of attributes on
+``<a>`` tags, and can even do things like skip ``<pre>`` sections.
+
+By default, ``linkify()`` will perform some sanitization, only allowing a set
+of "safe" tags. Because it uses the HTML5 parsing algorithm, it will always
+handle things like unclosed tags.
+
+.. note::
+ You may pass a ``string`` or ``unicode`` object, but Bleach will always
+ return ``unicode``.
+
+
+Callbacks
+=========
+
+The second argument to ``linkify()`` is a list or other iterable of callback
+functions. These callbacks can modify links that exist and links that are being
+created, or remove them completely.
+
+Each callback will get the following arguments::
+
+ def my_callback(attrs, new=False):
+
+The ``attrs`` argument is a dict of attributes of the ``<a>`` tag. The ``new``
+argument is a boolean indicating if the link is new (e.g. an email address or
+URL found in the text) or already existed (e.g. an ``<a>`` tag found in the
+text). The ``attrs`` dict also contains a ``_text`` key, which is the innerText
+of the ``<a>`` tag.
+
+The callback must return a dict of attributes (including ``_text``) or
+``None``. The new dict of attributes will be passed to the next callback in the
+list. If any callback returns ``None``, the link will not be created and the
+original text left in place, or will be removed, and its original innerText
+left in place.
+
+The default value is simply to add ``rel="nofollow"``. See ``bleach.callbacks``
+for some included callback functions.
+
+
+Setting Attributes
+------------------
+
+For example, to set ``rel="nofollow"`` on all links found in the text, a simple
+(and included) callback might be::
+
+ def set_nofollow(attrs, new=False):
+ attrs['rel'] = 'nofollow'
+ return attrs
+
+This would overwrite the value of the ``rel`` attribute if it was set.
+
+You could also make external links open in a new tab, or set a class::
+
+ from urlparse import urlparse
+
+ def set_target(attrs, new=False):
+ p = urlparse(attrs['href'])
+ if p.netloc not in ['my-domain.com', 'other-domain.com']:
+ attrs['target'] = '_blank'
+ attrs['class'] = 'external'
+ else:
+ attrs.pop('target', None)
+ return attrs
+
+
+Removing Attributes
+-------------------
+
+You can easily remove attributes you don't want to allow, even on existing
+links (``<a>`` tags) in the text. (See also :ref:`clean() <clean-chapter>` for
+sanitizing attributes.)
+
+::
+
+ def allowed_attributes(attrs, new=False):
+ """Only allow href, target, rel and title."""
+ allowed = ['href', 'target', 'rel', 'title']
+ return dict((k, v) for k, v in attrs.items() if k in allowed)
+
+Or you could remove a specific attribute, if it exists::
+
+ def remove_title1(attrs, new=False):
+ attrs.pop('title', None)
+ return attrs
+
+ def remove_title2(attrs, new=False):
+ if 'title' in attrs:
+ del attrs['title']
+ return attrs
+
+
+Altering Attributes
+-------------------
+
+You can alter and overwrite attributes, including the link text, via the
+``_text`` key, to, for example, pass outgoing links through a warning page, or
+limit the length of text inside an ``<a>`` tag.
+
+::
+
+ def shorten_url(attrs, new=False):
+ """Shorten overly-long URLs in the text."""
+ if not new: # Only looking at newly-created links.
+ return attrs
+ # _text will be the same as the URL for new links.
+ text = attrs['_text']
+ if len(text) > 25:
+ attrs['_text'] = text[0:22] + '...'
+ return attrs
+
+::
+
+ from urllib2 import quote
+ from urlparse import urlparse
+
+ def outgoing_bouncer(attrs, new=False):
+ """Send outgoing links through a bouncer."""
+ p = urlparse(attrs['href'])
+ if p.netloc not in ['my-domain.com', 'www.my-domain.com', '']:
+ bouncer = 'http://outgoing.my-domain.com/?destination=%s'
+ attrs['href'] = bouncer % quote(attrs['href'])
+ return attrs
+
+
+Preventing Links
+----------------
+
+A slightly more complex example is inspired by Crate_, where strings like
+``models.py`` are often found, and linkified. ``.py`` is the ccTLD for
+Paraguay, so ``example.py`` may be a legitimate URL, but in the case of a site
+dedicated to Python packages, odds are it is not. In this case, Crate_ could
+write the following callback::
+
+ def dont_linkify_python(attrs, new=False):
+ if not new: # This is an existing <a> tag, leave it be.
+ return attrs
+
+ # If the TLD is '.py', make sure it starts with http: or https:
+ href = attrs['href']
+ if href.endswith('.py') and not href.startswith(('http:', 'https:')):
+ # This looks like a Python file, not a URL. Don't make a link.
+ return None
+
+ # Everything checks out, keep going to the next callback.
+ return attrs
+
+
+Removing Links
+--------------
+
+If you want to remove certain links, even if they are written in the text with
+``<a>`` tags, you can still return ``None``::
+
+ def remove_mailto(attrs, new=False):
+ """Remove any mailto: links."""
+ if attrs['href'].startswith('mailto:'):
+ return None
+ return attrs
+
+
+``skip_pre``
+============
+
+``<pre>`` tags are often special, literal sections. If you don't want to create
+any new links within a ``<pre>`` section, pass ``skip_pre=True``.
+
+.. note::
+ Though new links will not be created, existing links created with ``<a>``
+ tags will still be passed through all the callbacks.
+
+
+``parse_email``
+===============
+
+By default, ``linkify()`` does not create ``mailto:`` links for email
+addresses, but if you pass ``parse_email=True``, it will. ``mailto:`` links
+will go through exactly the same set of callbacks as all other links, whether
+they are newly created or already in the text, so be careful when writing
+callbacks that may need to behave differently if the protocol is ``mailto:``.
+
+
+``tokenizer``
+============
+
+``linkify()`` uses the ``html5lib.sanitizer.HTMLSanitizer`` tokenizer by
+default. This has the effect of scrubbing some tags and attributes. To use a
+more lenient, or totally different, tokenizer, you can specify the tokenizer
+class here. (See the implementation of :ref:`clean() <clean-chapter>` for an
+example of building a custom tokenizer.)
+
+::
+
+ from html5lib.tokenizer import HTMLTokenizer
+ linked_text = linkify(text, tokenizer=HTMLTokenizer)
+
+
+.. _Crate: https://crate.io/
diff --git a/docs/make.bat b/docs/make.bat
new file mode 100644
index 0000000..84c919b
--- /dev/null
+++ b/docs/make.bat
@@ -0,0 +1,190 @@
+@ECHO OFF
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+ set SPHINXBUILD=sphinx-build
+)
+set BUILDDIR=_build
+set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
+set I18NSPHINXOPTS=%SPHINXOPTS% .
+if NOT "%PAPER%" == "" (
+ set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
+ set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
+)
+
+if "%1" == "" goto help
+
+if "%1" == "help" (
+ :help
+ echo.Please use `make ^<target^>` where ^<target^> is one of
+ echo. html to make standalone HTML files
+ echo. dirhtml to make HTML files named index.html in directories
+ echo. singlehtml to make a single large HTML file
+ echo. pickle to make pickle files
+ echo. json to make JSON files
+ echo. htmlhelp to make HTML files and a HTML help project
+ echo. qthelp to make HTML files and a qthelp project
+ echo. devhelp to make HTML files and a Devhelp project
+ echo. epub to make an epub
+ echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
+ echo. text to make text files
+ echo. man to make manual pages
+ echo. texinfo to make Texinfo files
+ echo. gettext to make PO message catalogs
+ echo. changes to make an overview over all changed/added/deprecated items
+ echo. linkcheck to check all external links for integrity
+ echo. doctest to run all doctests embedded in the documentation if enabled
+ goto end
+)
+
+if "%1" == "clean" (
+ for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
+ del /q /s %BUILDDIR%\*
+ goto end
+)
+
+if "%1" == "html" (
+ %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The HTML pages are in %BUILDDIR%/html.
+ goto end
+)
+
+if "%1" == "dirhtml" (
+ %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
+ goto end
+)
+
+if "%1" == "singlehtml" (
+ %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
+ goto end
+)
+
+if "%1" == "pickle" (
+ %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished; now you can process the pickle files.
+ goto end
+)
+
+if "%1" == "json" (
+ %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished; now you can process the JSON files.
+ goto end
+)
+
+if "%1" == "htmlhelp" (
+ %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished; now you can run HTML Help Workshop with the ^
+.hhp project file in %BUILDDIR%/htmlhelp.
+ goto end
+)
+
+if "%1" == "qthelp" (
+ %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished; now you can run "qcollectiongenerator" with the ^
+.qhcp project file in %BUILDDIR%/qthelp, like this:
+ echo.^> qcollectiongenerator %BUILDDIR%\qthelp\Bleach.qhcp
+ echo.To view the help file:
+ echo.^> assistant -collectionFile %BUILDDIR%\qthelp\Bleach.ghc
+ goto end
+)
+
+if "%1" == "devhelp" (
+ %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished.
+ goto end
+)
+
+if "%1" == "epub" (
+ %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The epub file is in %BUILDDIR%/epub.
+ goto end
+)
+
+if "%1" == "latex" (
+ %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
+ goto end
+)
+
+if "%1" == "text" (
+ %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The text files are in %BUILDDIR%/text.
+ goto end
+)
+
+if "%1" == "man" (
+ %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The manual pages are in %BUILDDIR%/man.
+ goto end
+)
+
+if "%1" == "texinfo" (
+ %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
+ goto end
+)
+
+if "%1" == "gettext" (
+ %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
+ goto end
+)
+
+if "%1" == "changes" (
+ %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.The overview file is in %BUILDDIR%/changes.
+ goto end
+)
+
+if "%1" == "linkcheck" (
+ %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Link check complete; look for any errors in the above output ^
+or in %BUILDDIR%/linkcheck/output.txt.
+ goto end
+)
+
+if "%1" == "doctest" (
+ %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
+ if errorlevel 1 exit /b 1
+ echo.
+ echo.Testing of doctests in the sources finished, look at the ^
+results in %BUILDDIR%/doctest/output.txt.
+ goto end
+)
+
+:end
diff --git a/requirements.txt b/requirements.txt
index c525a9e..1500a14 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,3 @@
# These are the requirements to run the test suite.
-nose
-html5lib
+nose==1.3.0
+html5lib==0.9.5
diff --git a/setup.py b/setup.py
index 115d811..e48c3f7 100644
--- a/setup.py
+++ b/setup.py
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
setup(
name='bleach',
- version='1.1.5',
+ version='1.2.2',
description='An easy whitelist-based HTML-sanitizing tool.',
long_description=open('README.rst').read(),
author='James Socol',
@@ -13,7 +13,7 @@ setup(
include_package_data=True,
package_data={'': ['README.rst']},
zip_safe=False,
- install_requires=['html5lib>=0.95'],
+ install_requires=['html5lib==0.95'],
classifiers=[
'Development Status :: 4 - Beta',
'Environment :: Web Environment',