Program: fixstyle (Perl Cookbook, 2nd Edition)

1.23. Program: fixstyle

Old words	New words
bonnet	hood
rubber	eraser
lorry	truck
trousers	pants

The program in Example 1-4 is a filter that changes all occurrences of each element in the first set to the corresponding element in the second set.

When called without filename arguments, the program is a simple filter. If filenames are supplied on the command line, an in-place edit writes the changes to the files, with the original versions saved in a file with a ".orig" extension. See Recipe 7.16 for a description. A -v command-line option writes notification of each change to standard error.

The table of original strings and their replacements is stored below _ _END_ _ in the main program, as described in Recipe 7.12. Each pair of strings is converted into carefully escaped substitutions and accumulated into the $code variable like the popgrep2 program in Recipe 6.10.

A -t check to test for an interactive run check tells whether we're expecting to read from the keyboard if no arguments are supplied. That way if users forget to give an argument, they aren't wondering why the program appears to be hung.

Example 1-4. fixstyle

    #!/usr/bin/perl -w
    # fixstyle - switch first set of <DATA> strings to second set
    #   usage: $0 [-v] [files ...]
    use strict;
    my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift);
    if (@ARGV) {
      $^I = ".orig";          # preserve old files
    } else {
      warn "$0: Reading from stdin\n" if -t STDIN;
    }
    my $code = "while (<>) {\n";
    # read in config, build up code to eval
    while (<DATA>) {
      chomp;
      my ($in, $out) = split /\s*=>\s*/;
      next unless $in && $out;
      $code .= "s{\\Q$in\\E}{$out}g";
      $code .= "&& printf STDERR qq($in => $out at \$ARGV line \$.\\n)"
                                                          if $verbose;
      $code .= ";\n";
    }
    $code .= "print;\n}\n";
    eval "{ $code } 1" || die;
    _ _END_ _
    analysed        => analyzed
    built-in        => builtin
    chastized       => chastised
    commandline     => command-line
    de-allocate     => deallocate
    dropin          => drop-in
    hardcode        => hard-code
    meta-data       => metadata
    multicharacter  => multi-character
    multiway        => multi-way
    non-empty       => nonempty
    non-profit      => nonprofit
    non-trappable   => nontrappable
    pre-define      => predefine
    preextend       => pre-extend
    re-compiling    => recompiling
    reenter         => re-enter
    turnkey         => turn-key

Example 1-5 is a version that's slower for few changes but faster when there are many changes.

Example 1-5. fixstyle2

    #!/usr/bin/perl -w
    # fixstyle2 - like fixstyle but faster for many many changes
    use strict;
    my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift);
    my %change = ( );
    while (<DATA>) {
      chomp;
      my ($in, $out) = split /\s*=>\s*/;
      next unless $in && $out;
      $change{$in} = $out;
    }
    if (@ARGV) {
      $^I = ".orig";
    } else {
      warn "$0: Reading from stdin\n" if -t STDIN;
    }
    while (<>) {
      my $i = 0;
      s/^(\s+)// && print $1;         # emit leading whitespace
      for (split /(\s+)/, $_, -1) {   # preserve trailing whitespace
          print( ($i++ & 1) ? $_ : ($change{$_} || $_));
      }
    }
    _ _END_ _
    analysed        => analyzed
    built-in        => builtin
    chastized       => chastised
    commandline     => command-line
    de-allocate     => deallocate
    dropin          => drop-in
    hardcode        => hard-code
    meta-data       => metadata
    multicharacter  => multi-character
    multiway        => multi-way
    non-empty       => nonempty
    non-profit      => nonprofit
    non-trappable   => nontrappable
    pre-define      => predefine
    preextend       => pre-extend
    re-compiling    => recompiling
    reenter         => re-enter
    turnkey         => turn-key

This version breaks each line into chunks of whitespace and words, which isn't a fast operation. It then uses those words to look up their replacements in a hash, which is much faster than a substitution. So the first part is slower, the second faster. The difference in speed depends on the number of matches.

If you don't care about keeping the whitespace separating each word constant, the second version can run as fast as the first, even for a few changes. If you know a lot about your input, collapse whitespace into single blanks by plugging in this loop:

# very fast, but whitespace collapse
while (<>) {
  for (split) {
      print $change{$_} || $_, " ";
  }
  print "\n";
}

my $pid = open(STDOUT, "|-");
die "cannot fork: $!" unless defined $pid;
unless ($pid) {             # child
      while (<STDIN>) {
      s/ $//;
      print;
  }
  exit;
}


1.22. Soundex Matching		1.24. Program: psgrep