Archive for the ‘OpenSource’ tag
Perl: Handle malformed UTF-8 strings with Encode::encode
Having the error message “Malformed UTF-8 character (fatal)” in my log files, I tried to handle this properly without letting the process die nor throwing away the whole string.
Having some research on Google I came up with following solution:
sub encode_utf_8 { my $string = @_;
my $utf8_encoded = ''; eval { $utf8_encoded = Encode::encode('UTF-8', $string, Encode::FB_CROAK); }; if ($@) { # sanitize malformed UTF-8 $utf8_encoded = ''; my @chars = split(//, $string); foreach my $char (@chars) { my $utf_8_char = eval { Encode::encode('UTF-8', $char, Encode::FB_CROAK) } or next; $utf8_encoded .= $utf_8_char; } } return $utf8_encoded;}See also:
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://www.perlmonks.org/?node_id=839519
Set a custom HTTP User-Agent in Perl with WWW::Mechanize
This is how you can dynamically set a custom HTTP User-Agent for your Perl requests to fake a device or browser for testing purpose or getting a device-specific version of a website.
WWW::Mechanize supports setting a custom user-agent with the constructor and after this gives a choice of 6 pre-defined basic user-agents ( $mech->agent_alias() ), only.
The following code demonstrates how to dynamically change the user-agent on a Mechanize object.
use WWW::Mechanize;
my $initial_user_agent = 'Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1';my @user_agents = ( 'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13', 'Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11', 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',);
# Set an initial custom header with the contructormy $mech = WWW::Mechanize->new( agent => $initial_user_agent );
# get a page and print current URI (WWW::Mechanize follows redirections)$mech->get( 'http://www.facebook.com' );print sprintf( "User-Agent %s\n redirects to: %s\n\n", $initial_user_agent, $mech->uri() );
foreach my $http_user_agent (@user_agents) { # dynamically set custom HTTP User-agents $mech->add_header( 'User-agent' => $http_user_agent);
$mech->get( 'http://www.facebook.com' ); print sprintf( "User-Agent %s\n redirects to: %s\n\n", $http_user_agent, $mech->uri() );}
# $ perl ./mechanize-user-agent.pl# User-Agent Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13# redirects to: http://www.facebook.com# # User-Agent Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdrStrip all HTML tags with Perl like PHP’s strip_tags() does
The Perl regular expression (regexp/regex) equivalent to PHP’s strip_tags() is:
while ($string =~ s/<\S[^<>]*(?:>|$)//gs) {};
Please note that it also denotes an opening “<” (followed by a non-whitespace character) as a tag and strips all characters behind, even it is not closed by a “>”. This is the same behavior as PHP’s strip_tags().
Update: This regexp is only satisfying my test against PHP 4.x, but 5.x is pretty smarter when it comes to edge cases. It will be a challenge to build a Perl equivalent as all the different approaches in CPAN also fail the test.
Update 2010-07-07: I’m currently porting strip_tags() from the C source code of PHP 5.3.2 to a CPAN Module. Stay tuned.
Update 2011-05-25: Today I finally uploaded my Perl port to CPAN: http://search.cpan.org/~hinnerk/HTML-StripTags-1.00/
New home of this module is http://www.hinnerk-altenburg.de/perl-strip_tags/
OpenSource Perl Website Intrusion Detection System PerlIDS (CGI::IDS) released
Today, we at epublica have officially released my work of the last months – a Perl port of PHPIDS, a tool for detection of Cross-Site-Scripting (XSS), Cross-Site-Request-Forgery (CSRF), SQL-Injections (SQLI), Local-File-Inclusions (LFI) etc. in website requests.
The tool is released as CGI::IDS Perl module “PerlIDS” on CPAN.org under the OpenSource “Lesser GNU Public License” (LGPL).
Two TYPO3 OpenSource extensions published
I am now the author of two TYPO3 extensions published in TER (TYPO3 Extension Repository). These extensions are frontend plugins that add functionality to the mm_forum extension.
exinit_latesttopics displays the latest forum topics in a box, exinit_pollwidget displays an AJAX box for forum polls to make voting possible on any page.
Meet me at FOSS.in/2005 Open Source Conference in Bangalore
The FOSS.in/2005 Free & Open Source Software Conference is from November 29 to December 2 at Bangalore Palace. I am listening to several lectures.
You’re welcome to join me if you are in Bangalore.
