aboutsummaryrefslogtreecommitdiff
path: root/doc/bugs/__38__uuml__59___in_markup_makes_ikiwiki_not_un-escape_HTML_at_all.mdwn
blob: eb3450a7e39a6715e4f788a99da4b6432046acf4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
I'm experimenting with using Ikiwiki as a feed aggregator.

The Planet Ubuntu RSS 2.0 feed (<http://planet.ubuntu.com/rss20.xml>) as of today
has someone whose name contains the character u-with-umlaut. In HTML 4.0, this is
specified as the character entity uuml. Ikiwiki 2.47 running on Debian etch does
not seem to understand that entity, and decides not to un-escape any markup in
the feed. This makes the feed hard to read.

The following is the test input:

    <rss version="2.0">
    <channel>
            <title>testfeed</title>
            <link>http://example.com/</link>
            <language>en</language>
            <description>example</description>
    <item>
            <title>&uuml;</title>
            <guid>http://example.com</guid>
            <link>http://example.com</link>
            <description>foo</description>
            <pubDate>Tue, 27 May 2008 22:42:42 +0000</pubDate>
    </item>
    </channel>
    </rss>

When I feed this to ikiwiki, it complains: 
"processed ok at 2008-05-29 09:44:14 (invalid UTF-8 stripped from feed) (feed entities escaped"

Note also that the test input contains only pure ASCII, no UTF-8 at all.

If I remove the ampersand in the title, ikiwiki has no problem. However, the entity is
valid HTML, so it would be good for ikiwiki to understand it. At the minimum, stripping
the offending entity but un-escaping the rest seems like a reasonable thing to do,
unless that has security implications.

> I tested on unstable, and ikiwiki handled that sample rss fine,
> generating a `ü.html`. --[[Joey]]

>> I confirm that it works with ikiwiki 2.50, at least partially. The HTML output is
>> OK, but the aggregate plugin still reports this:
>>
>>    processed ok at 2008-07-01 21:24:29 (invalid UTF-8 stripped from feed) (feed entities escaped)
>>
>> I hope that's just a minor blemish. --liw

>>> Sounds like this is [[done]] --[[Joey]]