Hinnerk Altenburg

Web Developer in Hamburg, Germany

Strip all HTML tags with Perl like PHP’s strip_tags() does

with 4 comments

The Perl regular expression (regexp/regex) equivalent to PHP’s strip_tags() is:

while ($string =~ s/<\S[^<>]*(?:>|$)//gs) {};

Please note that it also denotes an opening “<” (followed by a non-whitespace character) as a tag and strips all characters behind, even it is not closed by a “>”. This is the same behavior as PHP’s strip_tags().

Update: This regexp is only satisfying my test against PHP 4.x, but 5.x is pretty smarter when it comes to edge cases. It will be a challenge to build a Perl equivalent as all the different approaches in CPAN also fail the test.

Update 2010-07-07: I’m currently porting strip_tags() from the C source code of PHP 5.3.2 to a CPAN Module. Stay tuned.

Update 2011-05-25: Today I finally uploaded my Perl port to CPAN: http://search.cpan.org/~hinnerk/HTML-StripTags-1.00/
New home of this module is http://www.hinnerk-altenburg.de/perl-strip_tags/

Written by Hinnerk

Dezember 23rd, 2009 at 2:30 pm

Posted in English

Tagged with , , , ,