Code Profiling Techniques (Practical mod

9.5.1. Profiling with Devel::DProf

Devel::DProf collects information on the execution time of a Perl script and of the subroutines in that script.

Let's take for example the diagnostics pragma and measure the impact of its usage on the program compilation and execution speed. This pragma extends the terse diagnostics normally emitted by both the Perl compiler and the Perl interpreter, augmenting them with the more verbose and endearing descriptions found in the perldiag manpage. We have claimed that this pragma should not be used on a production server. We are going to use Devel::DProf to explain our claim.

We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called test_code( ).

The code inside the subroutine does either a lexical or a numeric comparison of two strings. It assigns one string to another if the condition tests true, but the condition is always false. To demonstrate the diagnostics pragma overhead, the comparison operator that we use in Example 9-7 is intentionally wrong. It should be a string comparison (eq), and we use a numeric one (= =).

Example 9-7. bench_diagnostics.pl

use Benchmark;
use diagnostics;
use strict;

my $count = 50000;

disable diagnostics;
my $t1 = timeit($count,\&test_code);

enable  diagnostics;
my $t2 = timeit($count,\&test_code);

print "Off: ",timestr($t1),"\n";
print "On : ",timestr($t2),"\n";

sub test_code {
    my ($a, $b) = qw(foo bar);
    my $c;
    if ($a =  = $b) {
        $c = $a;
    }
}

For only a few lines of code we get:

Off:  1 wallclock secs ( 0.81 usr +  0.00 sys =  0.81 CPU)
On : 13 wallclock secs (12.54 usr +  0.01 sys = 12.55 CPU)

With diagnostics enabled, the subroutine test_code( ) is 16 times slower (12.55/0.81: remember that we're talking in CPU time, not wallclock seconds) than with diagnostics disabled!

Now let's fix the comparison the way it should be, by replacing = = with eq, so we get:

my ($a, $b) = qw(foo bar);
my $c;
if ($a eq $b) {
    $c = $a;
}

and run the same benchmark again:

Off:  1 wallclock secs ( 0.57 usr +  0.00 sys =  0.57 CPU)
On :  1 wallclock secs ( 0.56 usr +  0.00 sys =  0.56 CPU)

Now there is no overhead at all. The diagnostics pragma slows things down only when warnings are generated.

After we have verified that using the diagnostics pragma might add a big overhead to execution runtime, let's use code profiling to understand why this happens. We use Devel::DProf to profile the code shown in Example 9-8.

Example 9-8. diagnostics.pl

use diagnostics;

test_code( );
sub test_code {
    my($a, $b) = qw(foo bar);
    my $c;
    if ($a =  = $b) {
        $c = $a;
    }
}

Run it with the profiler enabled, and then create the profiling statistics with the help of dprofpp:

panic% perl -d:DProf diagnostics.pl
panic% dprofpp

Total Elapsed Time = 0.342236 Seconds
  User+System Time = 0.335420 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 92.1   0.309  0.358      1   0.3089 0.3578  main::BEGIN
 14.9   0.050  0.039   3161   0.0000 0.0000  diagnostics::unescape
 2.98   0.010  0.010      2   0.0050 0.0050  diagnostics::BEGIN
 0.00   0.000 -0.000      2   0.0000      -  Exporter::import
 0.00   0.000 -0.000      2   0.0000      -  Exporter::export
 0.00   0.000 -0.000      1   0.0000      -  Config::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  Config::TIEHASH
 0.00   0.000 -0.000      2   0.0000      -  Config::FETCH
 0.00   0.000 -0.000      1   0.0000      -  diagnostics::import
 0.00   0.000 -0.000      1   0.0000      -  main::test_code
 0.00   0.000 -0.000      2   0.0000      -  diagnostics::warn_trap
 0.00   0.000 -0.000      2   0.0000      -  diagnostics::splainthis
 0.00   0.000 -0.000      2   0.0000      -  diagnostics::transmo
 0.00   0.000 -0.000      2   0.0000      -  diagnostics::shorten
 0.00   0.000 -0.000      2   0.0000      -  diagnostics::autodescribe

It's not easy to see what is responsible for this enormous overhead, even if main::BEGINseems to be running most of the time. To get the full picture we must see the OPs tree, which shows us who calls whom, so we run:

panic% dprofpp -T

The output is:

main::BEGIN
  diagnostics::BEGIN
     Exporter::import
        Exporter::export
  diagnostics::BEGIN
     Config::BEGIN
     Config::TIEHASH
     Exporter::import
        Exporter::export
  Config::FETCH
  Config::FETCH
  diagnostics::unescape
  .....................
  3159 times [diagnostics::unescape] snipped
  .....................
  diagnostics::unescape
  diagnostics::import
diagnostics::warn_trap
  diagnostics::splainthis
     diagnostics::transmo
     diagnostics::shorten
     diagnostics::autodescribe
main::test_code
  diagnostics::warn_trap
     diagnostics::splainthis
        diagnostics::transmo
        diagnostics::shorten
        diagnostics::autodescribe
  diagnostics::warn_trap
     diagnostics::splainthis
        diagnostics::transmo
        diagnostics::shorten
       diagnostics::autodescribe

So we see that 2 executions of diagnostics::BEGIN and 3,161 of diagnostics::unescape are responsible for most of the running overhead.

If we comment out the diagnostics module, we get:

Total Elapsed Time = 0.079974 Seconds
  User+System Time = 0.059974 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 0.00   0.000 -0.000      1   0.0000      -  main::test_code

It is possible to profile code running under mod_perl with the Devel::DProf module, available on CPAN. However, you must have PerlChildExitHandler enabled during the mod_perl build process. When the server is started, Devel::DProf installs an END block to write the tmon.out file. This block will be called at server shutdown. Here is how to start and stop a server with the profiler enabled:

panic% setenv PERL5OPT -d:DProf
panic% httpd -X -d `pwd` &
... make some requests to the server here ...
panic% kill `cat logs/httpd.pid`
panic% unsetenv PERL5OPT
panic% dprofpp

The Devel::DProf package is a Perl code profiler. It will collect information on the execution time of a Perl script and of the subroutines in that script (remember that print( ) and map( ) are just like any other subroutines you write, but they come bundled with Perl!).

Another approach is to use Apache::DProf, which hooks Devel::DProf into mod_perl. The Apache::DProf module will run a Devel::DProf profiler inside the process and write the tmon.out file in the directory $ServerRoot/logs/dprof/$$ (make sure that it's writable by the server!) when the process is shut down (where $$ is the PID of the process). All it takes to activate this module is to modify httpd.conf.

You can test for a command-line switch in httpd.conf. For example, to test if the server was started with -DPERLDPROF, use:

<Location /perl>
     SetHandler perl-script
     PerlHandler Apache::Registry
     <IfDefine PERLDPROF>
         PerlModule Apache::DProf
     </IfDefine>
</Location>

And to activate profiling, use:

panic% httpd -X -DPERLDPROF &

Remember that any PerlHandler that was pulled in before Apache::DProf in the httpd.conf or startup.pl file will not have code-debugging information inserted. To run dprofpp, chdir to $ServerRoot/logs/dprof/$$[36] and run:

[36]Look up the ServerRoot directive's value in httpd.conf to figure out what your $ServerRoot is.

panic% dprofpp

Use the command-line options for dropfpp(1) if a nondefault output is desired, as explained in the dropfpp manpage. You might especially want to look at the -r switch to display wallclock times (more relevant in a web-serving environment) and the -l switch to sort by number of subroutine calls.

If you are running Perl 5.6.0 or higher, take a look at the new module Devel::Profiler (Version 0.04 as of this writing), which is supposed to be a drop-in replacement for Apache::DProf, with improved functionality and stability.

9.5.2. Profiling with Devel::SmallProf

The Devel::SmallProf profiler is focused on the time taken for a program run on a line-by-line basis. It is called "small" because it's supposed to impose very little extra load on the machine (speed- and memory-wise) while profiling the code.

Let's take a look at the simple example shown in Example 9-9.

Example 9-9. table_gen.pl

for (1..1000) {
    my @rows = ( );
    push @rows, Tr(  map { td($_) } 'a'..'d' );
    push @rows, Tr(  map { td($_) } 'e'..'h' );
    my $var = table(@rows);
}
sub table { my @rows  = @_;    return "<table>\n@rows</table>\n";}
sub Tr    { my @cells = @_;    return "<tr>@cells</tr>\n";       }
sub td    { my $cell  = shift; return "<td>$cell</td>";          }

It creates the same HTML table in $var, with the cells of the table filled with single letters. The functions table( ), Tr( ), and td( ) insert the data into appropriate HTML tags. Notice that we have used Tr( ) and not tr( ), since the latter is a built-in Perl function, and we have used the same function name as in CGI.pm that does the same thing. If we print $var we will see something like this:

<table>
 <tr><td>a</td> <td>b</td> <td>c</td> <td>d</td></tr>
 <tr><td>e</td> <td>f</td> <td>g</td> <td>h</td></tr>
</table>

We have looped a thousand times through the same code in order to get a more precise speed measurement. If the code runs very quickly we won't be able to get any meaningful results from just one loop.

If we run this code with Devel::SmallProf:

panic% perl -d:SmallProf table_gen.pl

we get the following output in the autogenerated smallprof.out file:

count wall tm  cpu time line
  1001 0.003855 0.030000     1:  for (1..1000) {
  1000 0.004823 0.040000     2:      my @rows = ( );
  5000 0.272651 0.410000     3:      push @rows, Tr(  map { td($_) }
  5000 0.267107 0.360000     4:      push @rows, Tr(  map { td($_) }
  1000 0.067115 0.120000     5:      my $var = table(@rows);
     0 0.000000 0.000000     6:  }
  3000 0.033798 0.080000     7:  sub table { my @rows  = @_;    return
  6000 0.078491 0.120000     8:  sub Tr    { my @cells = @_;    return
 24000 0.267353 0.490000     9:  sub td    { my $cell  = shift; return
     0 0.000000 0.000000    10:

We can see in the CPU time column that Perl spends most of its time in the td( ) function; it's also the code that's visited by Perl the most times. In this example we could find this out ourselves without much effort, but if the code is longer it will be harder to find the lines of code where Perl spends most of its time. So we sort the output by the third column as a numerical value, in descending order:

panic% sort -k 3nr,3 smallprof.out | less
  24000 0.267353 0.490000     9:  sub td   { my $cell  = shift; return
   5000 0.272651 0.410000     3:      push @rows, Tr(  map { td($_) }
   5000 0.267107 0.360000     4:      push @rows, Tr(  map { td($_) }
   1000 0.067115 0.120000     5:      my $var = table(@rows);
   6000 0.078491 0.120000     8:  sub Tr    { my @cells = @_;    return
   3000 0.033798 0.080000     7:  sub table { my @rows  = @_;    return
   1000 0.004823 0.040000     2:      my @rows = ( );
   1001 0.003855 0.030000     1:  for (1..1000) {

According to the Devel::SmallProf manpage, the wallclock's measurements are fairly accurate (we suppose that they're correct on an unloaded machine), but CPU clock time is always more accurate. That's because if it takes more than one CPU time slice for a directive to complete, the time that some other process uses CPU is counted in the wallclock counts. Since the load on the same machine may vary greatly from moment to moment, it's possible that if we rerun the same test a few times we will get inconsistent results.

Let's try to improve the td( ) function and at the same time the Tr( ) and table( ) functions. We will not copy the passed arguments, but we will use them directly in all three functions. Example 9-10 shows the new version of our script.

Example 9-10. table_gen2.pl

for (1..1000) {
    my @rows = ( );
    push @rows, Tr(  map { td($_) } 'a'..'d' );
    push @rows, Tr(  map { td($_) } 'e'..'h' );
    my $var = table(@rows);
}
sub table { return "<table>\n@_</table>\n";}
sub Tr    { return "<tr>@_</tr>\n";      }
sub td    { return "<td>@_</td>";        }

Now let's rerun the code with the profiler:

panic% perl -d:SmallProf table_gen2.pl

The results are much better now—only 0.34 CPU clocks are spent in td( ), versus 0.49 in the earlier run:

panic% sort -k 3nr,3 smallprof.out | less
   5000 0.279138 0.400000  4:      push @rows, Tr(  map { td($_) }
  16000 0.241350 0.340000  9:  sub td    { return "<td>@_</td>";        }
   5000 0.269940 0.320000  3:      push @rows, Tr(  map { td($_) }
   4000 0.050050 0.130000  8:  sub Tr    { return "<tr>@_</tr>\n";      }
   1000 0.065324 0.080000  5:      my $var = table(@rows);
   1000 0.006650 0.010000  2:      my @rows = ( );
   2000 0.020314 0.030000  7:  sub table{ return "<table>\n@_</table>\n";}
   1001 0.006165 0.030000  1:  for (1..1000) {

You will also notice that Devel::SmallProf reports that the functions were executed different numbers of times in the two runs. That's because in our original example all three functions had two statements on each line, but in the improved version they each had only one. Devel::SmallProf looks at the code after it's been parsed and optimized by Perl—thus, if some optimizations took place, it might not be exactly the same as the code that you wrote.

In most cases you will probably find Devel::DProf more useful than Devel::SmallProf, as it allows you to analyze the code by subroutine and not by line.

Just as there is the Apache::DProf equivalent for Devel::DProf, there is the Apache::SmallProf equivalent for Devel::SmallProf. It uses a configuration similar to Apache::DProf—i.e., it is registered as a PerlFixupHandler—but it also requires Apache::DB. Therefore, to use it you should add the following configuration to httpd.conf:

<Perl>
    if (Apache->define('PERLSMALLPROF')) {
        require Apache::DB;
        Apache::DB->init;
    }
</Perl>

<Location /perl>
    SetHandler perl-script
    PerlHandler Apache::Registry
    <IfDefine PERLSMALLPROF>
        PerlFixupHandler Apache::SmallProf
    </IfDefine>
</Location>

Now start the server:

panic% httpd -X -DPERLSMALLPROF &

This will activate Apache::SmallProf::handler during the request. As a result, the profile files will be written to the $ServerRoot/logs/smallprof/ directory. Unlike with Devel::SmallProf, the profile is split into several files based on package name. For example, if CGI.pm was used, one of the generated profile files will be called CGI.pm.prof.