Archive for the ‘English’ Category
Perl: Handle malformed UTF-8 strings with Encode::encode
Having the error message “Malformed UTF-8 character (fatal)” in my log files, I tried to handle this properly without letting the process die nor throwing away the whole string.
Having some research on Google I came up with following solution:
sub encode_utf_8 { my $string = @_;
my $utf8_encoded = ''; eval { $utf8_encoded = Encode::encode('UTF-8', $string, Encode::FB_CROAK); }; if ($@) { # sanitize malformed UTF-8 $utf8_encoded = ''; my @chars = split(//, $string); foreach my $char (@chars) { my $utf_8_char = eval { Encode::encode('UTF-8', $char, Encode::FB_CROAK) } or next; $utf8_encoded .= $utf_8_char; } } return $utf8_encoded;}See also:
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://www.perlmonks.org/?node_id=839519
Set a custom HTTP User-Agent in Perl with WWW::Mechanize
This is how you can dynamically set a custom HTTP User-Agent for your Perl requests to fake a device or browser for testing purpose or getting a device-specific version of a website.
WWW::Mechanize supports setting a custom user-agent with the constructor and after this gives a choice of 6 pre-defined basic user-agents ( $mech->agent_alias() ), only.
The following code demonstrates how to dynamically change the user-agent on a Mechanize object.
use WWW::Mechanize;
my $initial_user_agent = 'Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1';my @user_agents = ( 'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13', 'Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11', 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',);
# Set an initial custom header with the contructormy $mech = WWW::Mechanize->new( agent => $initial_user_agent );
# get a page and print current URI (WWW::Mechanize follows redirections)$mech->get( 'http://www.facebook.com' );print sprintf( "User-Agent %s\n redirects to: %s\n\n", $initial_user_agent, $mech->uri() );
foreach my $http_user_agent (@user_agents) { # dynamically set custom HTTP User-agents $mech->add_header( 'User-agent' => $http_user_agent);
$mech->get( 'http://www.facebook.com' ); print sprintf( "User-Agent %s\n redirects to: %s\n\n", $http_user_agent, $mech->uri() );}
# $ perl ./mechanize-user-agent.pl# User-Agent Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13# redirects to: http://www.facebook.com# # User-Agent Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr# # User-Agent Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdrStrip all HTML tags with Perl like PHP’s strip_tags() does
The Perl regular expression (regexp/regex) equivalent to PHP’s strip_tags() is:
while ($string =~ s/<\S[^<>]*(?:>|$)//gs) {};
Please note that it also denotes an opening “<” (followed by a non-whitespace character) as a tag and strips all characters behind, even it is not closed by a “>”. This is the same behavior as PHP’s strip_tags().
Update: This regexp is only satisfying my test against PHP 4.x, but 5.x is pretty smarter when it comes to edge cases. It will be a challenge to build a Perl equivalent as all the different approaches in CPAN also fail the test.
Update 2010-07-07: I’m currently porting strip_tags() from the C source code of PHP 5.3.2 to a CPAN Module. Stay tuned.
Update 2011-05-25: Today I finally uploaded my Perl port to CPAN: http://search.cpan.org/~hinnerk/HTML-StripTags-1.00/
New home of this module is http://www.hinnerk-altenburg.de/perl-strip_tags/
Moved from epublica GmbH to XING AG
As per February, 1st I moved with epublica’s entire XING.com core development team to the XING AG itself, now developing the platform ‘inhouse’ as XING employee.
OpenSource Perl Website Intrusion Detection System PerlIDS (CGI::IDS) released
Today, we at epublica have officially released my work of the last months – a Perl port of PHPIDS, a tool for detection of Cross-Site-Scripting (XSS), Cross-Site-Request-Forgery (CSRF), SQL-Injections (SQLI), Local-File-Inclusions (LFI) etc. in website requests.
The tool is released as CGI::IDS Perl module “PerlIDS” on CPAN.org under the OpenSource “Lesser GNU Public License” (LGPL).
Relaunch of Derix Glasstudios website finally online
The relaunch of the corporate website of Derix Glasstudios, Taunusstein/Germany and Derix Art Glass Consultants, Portland/USA is now finally online!
I have already concepted and developed it in 2005 and I am happy to see it online now! The website is developed in PHP/MySQL with a custom-made admin interface.
Derix Glasstudios have been founded in 1866 and are today making art glass for prominent projects all over the world.
[Update] The website is now available in Russian and Spanish, too.
[Update] I am now also doing search engine optimization and Google AdWords campaigns for them.
[Update] Redesign now online with a new color theme and lightbox project viewer and AJAX projects preview using Prototype JS and Scriptaculous.
Two TYPO3 OpenSource extensions published
I am now the author of two TYPO3 extensions published in TER (TYPO3 Extension Repository). These extensions are frontend plugins that add functionality to the mm_forum extension.
exinit_latesttopics displays the latest forum topics in a box, exinit_pollwidget displays an AJAX box for forum polls to make voting possible on any page.
My New Jobs since May 2008
Since May, I am employed by epublica GmbH, Hamburg, doing Perl development mainly for the XING Web platform. Have a look at their brand new office in the heart of the city upstairs from XING.
Also I am working as a freelancer for the TYPO3 agency EXINIT GmbH & Co. KG, Hamburg doing TYPO3 extension development in PHP.
Hinnerk Altenburg a.k.a. Hinnerk Voss
I am married now and changed my family name to my wife’s family name! From now on, I am no longer called Hinnerk Voss but Hinnerk Altenburg.
Ich habe geheiratet und den Namen meiner Frau angenommen! Von nun an heiße ich nicht mehr Hinnerk Voss, sondern Hinnerk Altenburg.
Further Research on Brachytherapy at University of Marburg
The Prostate Center of the University of Marburg has released a description of their research topics, one of them is a further research related to my diploma thesis.
