Hinnerk Altenburg

Web Developer in Hamburg, Germany

Archive for the ‘English’ Category

Perl: Handle malformed UTF-8 strings with Encode::encode

without comments

Having the error message “Malformed UTF-8 character (fatal)” in my log files, I tried to handle this properly without letting the process die nor throwing away the whole string.
Having some research on Google I came up with following solution:

sub encode_utf_8 {
    my $string = @_;

    my $utf8_encoded = '';
    eval {
        $utf8_encoded = Encode::encode('UTF-8', $string, Encode::FB_CROAK);
    };
    if ($@) {
        # sanitize malformed UTF-8
        $utf8_encoded = '';
        my @chars = split(//, $string);
        foreach my $char (@chars) {
            my $utf_8_char = eval { Encode::encode('UTF-8', $char, Encode::FB_CROAK) }
                or next;
            $utf8_encoded .= $utf_8_char;
        }
    }
    return $utf8_encoded;
}

See also:
http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://www.perlmonks.org/?node_id=839519

Written by Hinnerk

August 31st, 2011 at 4:58 pm

Posted in English

Tagged with , , , , ,

Set a custom HTTP User-Agent in Perl with WWW::Mechanize

with 2 comments

This is how you can dynamically set a custom HTTP User-Agent for your Perl requests to fake a device or browser for testing purpose or getting a device-specific version of a website.
WWW::Mechanize supports setting a custom user-agent with the constructor and after this gives a choice of 6 pre-defined basic user-agents ( $mech->agent_alias() ), only.

The following code demonstrates how to dynamically change the user-agent on a Mechanize object.

use WWW::Mechanize;

my $initial_user_agent = 'Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1';
my @user_agents = (
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
    'Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11',
    'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
);

# Set an initial custom header with the contructor
my $mech = WWW::Mechanize->new( agent => $initial_user_agent );

# get a page and print current URI (WWW::Mechanize follows redirections)
$mech->get( 'http://www.facebook.com' );
print sprintf( "User-Agent %s\n redirects to: %s\n\n", $initial_user_agent, $mech->uri() );

foreach my $http_user_agent (@user_agents) {
    # dynamically set custom HTTP User-agents
    $mech->add_header( 'User-agent' => $http_user_agent);

    $mech->get( 'http://www.facebook.com' );
    print sprintf( "User-Agent %s\n redirects to: %s\n\n", $http_user_agent, $mech->uri() );
}

# $ perl ./mechanize-user-agent.pl
# User-Agent Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr
#
# User-Agent Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
# redirects to: http://www.facebook.com
#
# User-Agent Mozilla/5.0 (iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7D11
# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr
#
# User-Agent Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5
# redirects to: http://m.facebook.com/?w2m&refsrc=http%3A%2F%2Fwww.facebook.com%2F&_rdr

Written by Hinnerk

Juni 29th, 2011 at 9:21 pm

Posted in English

Tagged with , , , , ,

Strip all HTML tags with Perl like PHP’s strip_tags() does

with 4 comments

The Perl regular expression (regexp/regex) equivalent to PHP’s strip_tags() is:

while ($string =~ s/<\S[^<>]*(?:>|$)//gs) {};

Please note that it also denotes an opening “<” (followed by a non-whitespace character) as a tag and strips all characters behind, even it is not closed by a “>”. This is the same behavior as PHP’s strip_tags().

Update: This regexp is only satisfying my test against PHP 4.x, but 5.x is pretty smarter when it comes to edge cases. It will be a challenge to build a Perl equivalent as all the different approaches in CPAN also fail the test.

Update 2010-07-07: I’m currently porting strip_tags() from the C source code of PHP 5.3.2 to a CPAN Module. Stay tuned.

Update 2011-05-25: Today I finally uploaded my Perl port to CPAN: http://search.cpan.org/~hinnerk/HTML-StripTags-1.00/
New home of this module is http://www.hinnerk-altenburg.de/perl-strip_tags/

Written by Hinnerk

Dezember 23rd, 2009 at 2:30 pm

Posted in English

Tagged with , , , ,

Moved from epublica GmbH to XING AG

without comments

As per February, 1st I moved with epublica’s entire XING.com core development team to the XING AG itself, now developing the platform ‘inhouse’ as XING employee.

Written by Hinnerk

Februar 6th, 2009 at 12:16 pm

Posted in English

Tagged with , , , ,

OpenSource Perl Website Intrusion Detection System PerlIDS (CGI::IDS) released

with one comment

Today, we at epublica have officially released my work of the last months – a Perl port of PHPIDS, a tool for detection of Cross-Site-Scripting (XSS), Cross-Site-Request-Forgery (CSRF), SQL-Injections (SQLI), Local-File-Inclusions (LFI) etc. in website requests.
The tool is released as CGI::IDS Perl module “PerlIDS” on CPAN.org under the OpenSource “Lesser GNU Public License” (LGPL).

Read the rest of this entry »

Written by Hinnerk

November 6th, 2008 at 1:36 pm

Posted in English

Tagged with , , , , , ,

Relaunch of Derix Glasstudios website finally online

without comments

The relaunch of the corporate website of Derix Glasstudios, Taunusstein/Germany and Derix Art Glass Consultants, Portland/USA is now finally online!

I have already concepted and developed it in 2005 and I am happy to see it online now! The website is developed in PHP/MySQL with a custom-made admin interface.

Derix Glasstudios have been founded in 1866 and are today making art glass for prominent projects all over the world.

[Update] The website is now available in Russian and Spanish, too.

[Update] I am now also doing search engine optimization and Google AdWords campaigns for them.

[Update] Redesign now online with a new color theme and lightbox project viewer and AJAX projects preview using Prototype JS and Scriptaculous.

Written by Hinnerk

Juli 16th, 2008 at 11:15 pm

Two TYPO3 OpenSource extensions published

with 2 comments

I am now the author of two TYPO3 extensions published in TER (TYPO3 Extension Repository). These extensions are frontend plugins that add functionality to the mm_forum extension.

exinit_latesttopics displays the latest forum topics in a box, exinit_pollwidget displays an AJAX box for forum polls to make voting possible on any page.

Written by Hinnerk

Juni 25th, 2008 at 12:20 am

My New Jobs since May 2008

without comments

Since May, I am employed by epublica GmbH, Hamburg, doing Perl development mainly for the XING Web platform. Have a look at their brand new office in the heart of the city upstairs from XING.

Also I am working as a freelancer for the TYPO3 agency EXINIT GmbH & Co. KG, Hamburg doing TYPO3 extension development in PHP.

Written by Hinnerk

Juni 25th, 2008 at 12:10 am

Hinnerk Altenburg a.k.a. Hinnerk Voss

without comments

I am married now and changed my family name to my wife’s family name! From now on, I am no longer called Hinnerk Voss but Hinnerk Altenburg.

Ich habe geheiratet und den Namen meiner Frau angenommen! Von nun an heiße ich nicht mehr Hinnerk Voss, sondern Hinnerk Altenburg.

Written by Hinnerk

Oktober 26th, 2007 at 12:04 pm

Posted in Deutsch,English

Tagged with , ,

Further Research on Brachytherapy at University of Marburg

without comments

The Prostate Center of the University of Marburg has released a description of their research topics, one of them is a further research related to my diploma thesis.

Written by Hinnerk

März 24th, 2007 at 9:18 pm

Posted in English

Tagged with