diff --git a/docs/clean.rst b/docs/clean.rst
new file mode 100644
index 0000000..a31dc89
--- /dev/null
+++ b/docs/clean.rst
@@ -0,0 +1,122 @@
+.. _clean-chapter:
+.. highlightlang:: python
+``clean()`` is Bleach's HTML sanitization method::
+ def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
+ styles=ALLOWED_STYLES, strip=False, strip_comments=True):
+ """Clean an HTML fragment and return it."""
+Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing
+algorithm and sanitize any disallowed tags or attributes. This algorithm also
+takes care of things like unclosed and (some) misnested tags.
+.. note::
+ You may pass in a ``string`` or a ``unicode`` object, but Bleach will
+ always return ``unicode``.
+Tag Whitelist
+The ``tags`` kwarg is a whitelist of allowed HTML tags. It should be a list,
+tuple, or other iterable. Any other HTML tags will be escaped or stripped from
+the text. Its default value is a relatively conservative list found in
+Attribute Whitelist
+The ``attributes`` kwarg is a whitelist of attributes. It can be a list, in
+which case the attributes are allowed for any tag, or a dictionary, in which
+case the keys are tag names (or a wildcard: ``*`` for all tags) and the values
+are lists of allowed attributes.
+For example::
+ attrs = {
+ '*': ['class'],
+ 'a': ['href', 'rel'],
+ 'img': ['src', 'alt'],
+ }
+In this case, ``class`` is allowed on any allowed element (from the ``tags``
+argument), ``<a>`` tags are allowed to have ``href`` and ``rel`` attributes,
+and so on.
+The default value is also a conservative dict found in
+Callable Filters
+You can also use a callable (instead of a list) in the ``attributes`` kwarg. If
+the callable returns ``True``, the attribute is allowed. Otherwise, it is
+stripped. For example::
+ def filter_src(name, value):
+ if name in ('alt', 'height', 'width'):
+ return True
+ if name == 'src':
+ p = urlparse(value)
+ return (not p.netloc) or p.netloc == 'mydomain.com'
+ return False
+ attrs = {
+ 'img': filter_src,
+ }
+Styles Whitelist
+If you allow the ``style`` attribute, you will also need to whitelist styles
+users are allowed to set, for example ``color`` and ``background-color``.
+The default value is an empty list, i.e., the ``style`` attribute will be
+allowed but no values will be.
+For example, to allow users to set the color and font-weight of text::
+ attrs = {
+ '*': 'style'
+ }
+ tags = ['p', 'em', 'strong']
+ styles = ['color', 'font-weight']
+ cleaned_text = bleach.clean(text, tags, attrs, styles)
+Stripping Markup
+By default, Bleach *escapes* disallowed or invalid markup. For example::
+ >>> bleach.clean('<span>is not allowed</span>')
+ u'&lt;span&gt;is not allowed&lt;/span&gt;
+If you would rather Bleach stripped this markup entirely, you can pass
+ >>> bleach.clean('<span>is not allowed</span>', strip=True)
+ u'is not allowed'
+Stripping Comments
+By default, Bleach will strip out HTML comments. To disable this behavior, set
+ >>> html = 'my<!-- commented --> html'
+ >>> bleach.clean(html)
+ u'my html'
+ >>> bleach.clean(html, strip_comments=False)
+ u'my<!-- commented --> html'
diff --git a/docs/goals.rst b/docs/goals.rst
new file mode 100644
index 0000000..5477f9c
--- /dev/null
+++ b/docs/goals.rst
@@ -0,0 +1,76 @@
+Goals of Bleach
+This document lists the goals and non-goals of Bleach. My hope is that by
+focusing on these goals and explicitly listing the non-goals, the project will
+evolve in a stronger direction.
+Bleach should always take a whitelist-based approach to allowing any kind of
+content or markup. Blacklisting is error-prone and not future proof.
+For example, you should have to opt-in to allowing the ``onclick`` attribute,
+not blacklist all the other ``on*`` attributes. Future versions of HTML may add
+new event handlers, like ``ontouch``, that old blacklists would not prevent.
+Sanitizing Input
+The primary goal of Bleach is to sanitize user input that is allowed to contain
+*some* HTML as markup and is to be included in the content of a larger page.
+Examples might include:
+* User comments on a blog.
+* "Bio" sections of a user profile.
+* Descriptions of a product or application.
+These examples, and others, are traditionally prone to security issues like XSS
+or other script injection, or annoying issues like unclosed tags and invalid
+markup. Bleach will take a proactive, whitelist-only approach to allowing HTML
+content, and will use the HTML5 parsing algorithm to handle invalid markup.
+See the :ref:`chapter on clean() <clean-chapter>` for more info.
+Safely Creating Links
+The secondary goal of Bleach is to provide a mechanism for finding or altering
+links (``<a>`` tags with ``href`` attributes, or things that look like URLs or
+email addresses) in text.
+While Bleach itself will always operate on a whitelist-based security model,
+the :ref:`linkify() method <linkify-chapter>` is flexible enough to allow the
+creation, alteration, and removal of links based on an extremely wide range of
+use cases.
+Bleach is designed to work with fragments of HTML by untrusted users. Some
+non-goal use cases include:
+* **Sanitizing complete HTML documents.** Once you're creating whole documents,
+ you have to allow so many tags that a blacklist approach (e.g. forbidding
+ ``<script>`` or ``<object>``) may be more appropriate.
+* **Cleaning up after trusted users.** Bleach is powerful but it is not fast.
+ If you trust your users, trust them and don't rely on Bleach to clean up
+ their mess.
+* **Allowing arbitrary styling.** There are a number of interesting CSS
+ properties that can do dangerous things, like Opera's ``-o-link``. Painful as
+ it is, if you want your users to be able to change nearly anything in a
+ ``style`` attribute, you should have to opt into this.
diff --git a/docs/index.rst b/docs/index.rst
new file mode 100644
index 0000000..0929e53
--- /dev/null
+++ b/docs/index.rst
@@ -0,0 +1,69 @@
+.. Bleach documentation master file, created by
+ sphinx-quickstart on Fri May 11 21:11:39 2012.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+Welcome to Bleach's documentation!
+Bleach is a whitelist-based HTML sanitization and text linkification library.
+It is designed to take untrusted user input with *some* HTML.
+Because Bleach uses html5lib_ to parse document fragments the same way browsers
+do, it is extremely resilient to unknown attacks, much more so than
+regular-expression-based sanitizers.
+Bleach's ``linkify`` function is highly configurable and can be used to find,
+edit, and filter links most other auto-linkers can't.
+The version of bleach on GitHub_ is the always the most up-to-date and the
+``master`` branch should always work.
+.. warn::
+ Bleach is currently incompatible with html5lib 1.0b and any versions below
+ 0.9.5.
+Installing Bleach
+Bleach is available on PyPI_, so you can install it with ``pip``::
+ $ pip install bleach
+Or with ``easy_install``::
+ $ easy_install bleach
+Or by cloning the repo from GitHub_::
+ $ git clone git://github.com/jsocol/bleach.git
+Then install it by running::
+ $ python setup.py install
+.. toctree::
+ :maxdepth: 2
+ clean
+ linkify
+ goals
+Indices and tables
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+.. _html5lib: http://code.google.com/p/html5lib/
+.. _GitHub: https://github.com/jsocol/bleach
+.. _PyPI: http://pypi.python.org/pypi/bleach
diff --git a/docs/linkify.rst b/docs/linkify.rst
new file mode 100644
index 0000000..42de69c
--- /dev/null
+++ b/docs/linkify.rst
@@ -0,0 +1,212 @@
+.. _linkify-chapter:
+.. highlightlang:: python
+``linkify()`` searches text for links, URLs, and email addresses and lets you
+control how and when those links are rendered::
+ def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_pre=False,
+ parse_email=False, tokenizer=HTMLSanitizer):
+ """Convert URL-like strings in an HTML fragment to links.
+``linkify()`` works by building a document tree, so it's guaranteed never to do
+weird things to URLs in attribute values, can modify the value of attributes on
+``<a>`` tags, and can even do things like skip ``<pre>`` sections.
+By default, ``linkify()`` will perform some sanitization, only allowing a set
+of "safe" tags. Because it uses the HTML5 parsing algorithm, it will always
+handle things like unclosed tags.
+.. note::
+ You may pass a ``string`` or ``unicode`` object, but Bleach will always
+ return ``unicode``.
+The second argument to ``linkify()`` is a list or other iterable of callback
+functions. These callbacks can modify links that exist and links that are being
+created, or remove them completely.
+Each callback will get the following arguments::
+ def my_callback(attrs, new=False):
+The ``attrs`` argument is a dict of attributes of the ``<a>`` tag. The ``new``
+argument is a boolean indicating if the link is new (e.g. an email address or
+URL found in the text) or already existed (e.g. an ``<a>`` tag found in the
+text). The ``attrs`` dict also contains a ``_text`` key, which is the innerText
+of the ``<a>`` tag.
+The callback must return a dict of attributes (including ``_text``) or
+``None``. The new dict of attributes will be passed to the next callback in the
+list. If any callback returns ``None``, the link will not be created and the
+original text left in place, or will be removed, and its original innerText
+left in place.
+The default value is simply to add ``rel="nofollow"``. See ``bleach.callbacks``
+for some included callback functions.
+Setting Attributes
+For example, to set ``rel="nofollow"`` on all links found in the text, a simple
+(and included) callback might be::
+ def set_nofollow(attrs, new=False):
+ attrs['rel'] = 'nofollow'
+ return attrs
+This would overwrite the value of the ``rel`` attribute if it was set.
+You could also make external links open in a new tab, or set a class::
+ from urlparse import urlparse
+ def set_target(attrs, new=False):
+ p = urlparse(attrs['href'])
+ if p.netloc not in ['my-domain.com', 'other-domain.com']:
+ attrs['target'] = '_blank'
+ attrs['class'] = 'external'
+ else:
+ attrs.pop('target', None)
+ return attrs
+Removing Attributes
+You can easily remove attributes you don't want to allow, even on existing
+links (``<a>`` tags) in the text. (See also :ref:`clean() <clean-chapter>` for
+sanitizing attributes.)
+ def allowed_attributes(attrs, new=False):
+ """Only allow href, target, rel and title."""
+ allowed = ['href', 'target', 'rel', 'title']
+ return dict((k, v) for k, v in attrs.items() if k in allowed)
+Or you could remove a specific attribute, if it exists::
+ def remove_title1(attrs, new=False):
+ attrs.pop('title', None)
+ return attrs
+ def remove_title2(attrs, new=False):
+ if 'title' in attrs:
+ del attrs['title']
+ return attrs
+Altering Attributes
+You can alter and overwrite attributes, including the link text, via the
+``_text`` key, to, for example, pass outgoing links through a warning page, or
+limit the length of text inside an ``<a>`` tag.
+ def shorten_url(attrs, new=False):
+ """Shorten overly-long URLs in the text."""
+ if not new: # Only looking at newly-created links.
+ return attrs
+ # _text will be the same as the URL for new links.
+ text = attrs['_text']
+ if len(text) > 25:
+ attrs['_text'] = text[0:22] + '...'
+ return attrs
+ from urllib2 import quote
+ from urlparse import urlparse
+ def outgoing_bouncer(attrs, new=False):
+ """Send outgoing links through a bouncer."""
+ p = urlparse(attrs['href'])
+ if p.netloc not in ['my-domain.com', 'www.my-domain.com', '']:
+ bouncer = 'http://outgoing.my-domain.com/?destination=%s'
+ attrs['href'] = bouncer % quote(attrs['href'])
+ return attrs
+Preventing Links
+A slightly more complex example is inspired by Crate_, where strings like
+``models.py`` are often found, and linkified. ``.py`` is the ccTLD for
+Paraguay, so ``example.py`` may be a legitimate URL, but in the case of a site
+dedicated to Python packages, odds are it is not. In this case, Crate_ could
+write the following callback::
+ def dont_linkify_python(attrs, new=False):
+ if not new: # This is an existing <a> tag, leave it be.
+ return attrs
+ # If the TLD is '.py', make sure it starts with http: or https:
+ href = attrs['href']
+ if href.endswith('.py') and not href.startswith(('http:', 'https:')):
+ # This looks like a Python file, not a URL. Don't make a link.
+ return None
+ # Everything checks out, keep going to the next callback.
+ return attrs
+Removing Links
+If you want to remove certain links, even if they are written in the text with
+``<a>`` tags, you can still return ``None``::
+ def remove_mailto(attrs, new=False):
+ """Remove any mailto: links."""
+ if attrs['href'].startswith('mailto:'):
+ return None
+ return attrs
+``<pre>`` tags are often special, literal sections. If you don't want to create
+any new links within a ``<pre>`` section, pass ``skip_pre=True``.
+.. note::
+ Though new links will not be created, existing links created with ``<a>``
+ tags will still be passed through all the callbacks.
+By default, ``linkify()`` does not create ``mailto:`` links for email
+addresses, but if you pass ``parse_email=True``, it will. ``mailto:`` links
+will go through exactly the same set of callbacks as all other links, whether
+they are newly created or already in the text, so be careful when writing
+callbacks that may need to behave differently if the protocol is ``mailto:``.
+``linkify()`` uses the ``html5lib.sanitizer.HTMLSanitizer`` tokenizer by
+default. This has the effect of scrubbing some tags and attributes. To use a
+more lenient, or totally different, tokenizer, you can specify the tokenizer
+class here. (See the implementation of :ref:`clean() <clean-chapter>` for an
+example of building a custom tokenizer.)
+ from html5lib.tokenizer import HTMLTokenizer
+ linked_text = linkify(text, tokenizer=HTMLTokenizer)
+.. _Crate: https://crate.io/
