CQ: Perl tutorial

Perl tutorial

(If you got here by searching for "CQ Perl", you may be looking for the Cute Queries module over on CPAN.)

This entry has seven examples of Perl tricks, idioms, and shortcuts that I think will help
make code more maintainable and readable, and will give you a glimpse of some of the power
of the language. The target audience is Unix/Linux users who have some programming
background and some basic knowledge of perl and regular expressions.

Unless otherwise specified, each example assumes that the following special scalar is set:

$\ = "\n";

$\ is the "output record separator", meaning the thing that is added automatically to any
print command.

Item 1 : split, $_, and anonymous lists

If we have a variable, $string, that we want to pull the third word out of, the standard
approach is as follows:

$string = "The quick brown fox..."; # Define the string
@array = split / /,$string;         # Split the string by spaces, feed to @array
$answer = $array[2];          # Arrays are 0-indexed, [2] is the third item
print $answer;

Executing this gives us the expected answer of "brown", but to get there we had to define
an array, pass two variables to &split, create a new variable to hold the array element we
wanted, and then pass it to the print statement. A lot of needless work.

A moderately experienced programmer could see that there isn't a need for $answer, and we
can immediately reduce the script to:

$string = "The quick brown fox...";
@array = split / /,$string;
print $array[2];

There are other ways to simplify this, some more strange looking (at first) than others.
There is no need for the array variable in this case; we can use a scalar (string)
variable that references an anonymous list instead.

An anonymous list means simply that Perl will destroy the list when nothing else in the
code references it, saving memory, and no variable name is needed. This is a great tool
when you're only intersted in one item in a function's return list. To create an anonymous
list out of a function, simply wrap it in parenthesis:

$string = "The quick brown fox...";
$answer = (split / /,$string)[2];
print $answer;

The split function itself can be simplified, as both variables passed to it are optional.
The default scalar to act on is the special variable $_, and the default pattern to split
on is /\s+/, or "one or more whitespace characters". For the layout of our sentence, / /
and /\s+/ match the same items.

If we replace $string with $_, we can save a lot typing:

$_ = "The quick brown fox...";
$answer = (split)[2];
print $answer;

"Hey wait! Can't I just say 'print (split)[2]'?" No, you can't. The parentheses and
brackets are ambigious when passed to the print function that way. You can, however, wrap
the whole mess in another set of parentheses:

$_ = "The quick brown fox...";
print ((split)[2]);

So now we're down to two lines, no extra array or scalar variables, and a net savings of
44 characters. Not bad.

Trying to trim things down to this level adds other complications, such as not being able
to quote or interpolate the results. 'print "The answer is ((split)[2])";' will print out
the literal text "The answer is ((split)[2])". Two workarounds to that are to scoot the
anonymous list outside of the quotes, like so:

print "The answer is " . ((split)[2]);

...or to use another interpolation method, the anonymous array. An array is like a list,
but it has additional properties, such as returning the number of items if it is called in
a scalar context. For example:

@array = qw/red green blue/;
print scalar @array;

will return the number 3, the number of items in the array. An anonymous array can be
interpolated into a quoted string by using the @{} and ${} dereference operators. The
first works on the array as a whole, the second works on an element to be specified later.

To create an anonymous array out of a function like &split that returns a list, surround
the function call in brackets, e.g., [split / /, $_], or just [split]. Putting that into a
quoted string to be printed and referencing only one element looks a little odd:

$_ = "The quick brown fox...";
print "The answer is ${[split]}[2]";

That's a little too much punctuation for my taste, so I like to stick with lists and
avoiding quotes whenever I can.

Item 2 : for(), the array/list helper

This example will operate on the following array:

@array = qw/ one two three four five six seven /;

The qw// function returns a list given a whitespace delimited string. The above line is
the same as saying: "@array = ('one', 'two', 'three', ...)", but it is much easier to
type, and it reads better.

The for() function functions in two ways. It can iterate through the items of a list, or
it can function like a C or Java for() function. The line:

for (my $n=0; $n<=$#array; $n++)  { print $array[$n]; }

...performs the same function as this line:

for (@array) { print $_; }    # $_ is set to the current element in a for() loop

... or this one, which uses a shortcut syntax allowed when you only have one thing to do
in a conditional loop:

print $_ for @array;

To make things even better, we can remember that $_ is the default variable for most
functions, like print:

print for @array;

Each of the above statements will print out the list you expect, the numbers one, two,
etc., each on their own line. They will each be on their own line only because we set $\,
the "output record separator" to a newline, otherwise we would have to interpolate a
newline and the array elements:

my $\ = undef; # The default output record separator value
for (my $n=0; $n<=$#array; $n++)  { print "$array[$n]\n"; }
print "$_\n" for @array;

Since for() operates on lists and not explicitly on arrays, we can do some fancy tricks,
like throwing in the grep() function to only display list items containing the letter e:

for (grep {/e/} @array) { print $_; }  # same as: print for qw/one three five seven/;

The list that is iterated through here is the result of grepping @array for /e/. On each
iteration, $_ is updated, just like the for(@array) lines above. This can be simplified
(although it will look funny until you're used to it) to:

print for grep {/e/} @array;

You may notice at this point that print, for, and grep are all functions. This tells us
that functions operate starting on the right, returning their results to the function on
the left, all of which can be done pretty much free of punctuation. It also looks like
for() returns a list to print() and calls print() one time. It doesn't. for() iterates
through it's list, and calls the function on the left once per element, updating $_ as it
goes. Its sister conditional loop operator, while(), does the same thing.

Item 3 : More array helpers, shift/unshift and pop/push

While for() operates on lists or arrays, pop/push and shift/unshift only operate on
arrays. The functions shift and unshift operate on the left side of an array. The default
array they work on if none is specified is @_, the array used to pass options to a
subroutine.

shift() returns the first element of an array, and deletes it from the array.

@array = qw /two three four/;
$answer = shift @array;

In this example, @array gets shortened to ('three', 'four'), and the scalar $answer
becomes 'two'.

unshift @array, $answer;

This sets @array back to what it was initially.

As I hinted at above, shift() is commonly used in subroutines to pull values out of the @_
array. @_ is the default array shift() operates on, and the array subroutines use to pass
options.

sub do_stuff  {
$first = shift;
$second = shift;
print "First - $first";
print "Second - $second";
}
 
&do_stuff ('one', 'two'); # prints "First - one\nSecond - two"

The list ('one', 'two') gets passed to &do_stuff, which assigns it to @_. The shift
statements operate on @_ by default, pulling the first item off the array for each call.
The array @_ never needs to be explicitly mentioned, however. This is still a lot of code
for something so simple, and can be reduced a couple obvious ways:

sub do_stuff  {
print "First - " . shift;
print "Second - " . shift;
}

Or shift() can be removed and @_ can be referenced:

sub do_stuff { print "First - $_[0]\nSecond - $_[1]"; }

The functions pop and push do the same things as shift and unshift, but operate on the
right side of an array. A cool use for pop() is in conjunction with a split() function to
pull off the last word of a string:

$_ = "The quick brown fox";
@array = split;
print pop @array;       # prints 'fox'

Cool, but less efficient than using a regular expression to do the same thing:

$_ = "The quick brown fox";
/\s(\S+)\s*$/; # Matches the last set of non-whitespace (\S) in a string
print $1;      # $1 is the match in the first set of parenthesis, undef if no match

Or we could exploit a previous example and the fact that a negative list index starts from
the right:

$_ = "The quick brown fox";
print ((split)[-1]);    # prints 'fox'

The power of shift() and pop(), though, is that it they can be called repeatedly to
iterate through, and ultimately destroy, an array.

A common trick is to mix push and shift (or unshift and pop) to "rotate" an array:

@array = qw/one two three/;
push @array, shift @array;  # @array is now ('two', 'three', 'one');
unshift @array, pop @array; # @array is now ('one', 'two', 'three');

Item 4 : Hashes, references, and embedding

People get choked up about hashes until they use them a little bit, then they never go
back. In fact, it was the combined power of hashes and regular expressions that won me
over to Perl. Explaining what they are and simple usage is easy: Hashes are arrays, just
replacing numeric indices with strings. When referring to hashes, use % instead of @, {}
instead of [], unique names instead of indices, and order doesn't matter.

Consider these two snippets, both trying to print out llama's password:

# Using arrays
@usernames = qw /camel llama owl gecko ram panther/;
@passwords = qw /l337  hax0r 9e9 0wnzz j00 ph34rm3/;
$count = 0;
for(0..$#usernames) { # returns a list from 0 to the last index of @usernames
# For more on the .. operator, see item 6
print $passwords[$count] if $usernames[$count] eq 'llama';
$count++;
}
 

# Using hashes
%passwords = ('camel'=>'l337', 'llama'=>'hax0r', 'owl'=>'9e9', 'gecko'=>'0wnzz',
'ram'=>'j00', 'panther'=>'ph34rm3');
print $passwords{'llama'};

The second example is clearly easier. The => symbol in the hash definition means the same
thing as a comma, and is used only for the visual separation. Defining a hash can be done
one element at a time, for example :

$passwords{'sk8r'}='k-r4d';

...or by passing it a list. The order of list items is important when passing a list to a
hash constructor. For each element of the list, one item is the key, and the next is the
value. This means that my %passwords = line from above could be changed to:

%passwords = qw/camel l337 llama hax0r owl 9e9 gecko 0wnzz ram j00 panther ph34rm3/;

...which is perfectly legal, and easier to type, but much harder to read. Use the => and
comma methods to help you follow the code later. In this case, an ounce of prevention...

Hash keys must always be strings. Hash values, however, can be anything a scalar can be: a
string, number, or reference to another data type. A common use for hashes is to have
their values be references to other hashes. This sounds confusing at first, but consider
the following:

%users = ();   # Start with an empty hash
$users{llama}{password}     = 'hax0r';
$users{llama}{screen_width} = 800;
$users{llama}{last}        = 1067978749;  # Nov 4, 2003, about 15:45
 
$users{camel}{password}     = 'l337';
$users{camel}{screen_width} = 1024;
$users{camel}{last}        = 0;        # Never logged on after signing up

Here %users is a hash whose keys are usernames (the quotes are optional for keys, by the
way), and who's values are other hashes. This syntax goes by a number of different names:
embedded hash, hash of hashes, HoH. What it really is, though, is a hash of anonymous
hash references. It could also be defined like so:

%users = ('llama' => {'password'=>'hax0r', 'screen_width'=>800, 'last'=>1067978749},
'camel' => {'password'=>'l337' , 'screen_width'=>1024, 'last'=>0} );

%users has only two items: the key 'llama', who's value is a reference to an anonymous
hash, and the key 'camel', who's value is also an anonymous hashref. If I printed out the
elements of %users, it would look like this:

while ( ($key,$value) = each %users ) { print "$key : $value"; }
__OUTPUT__
camel : HASH(0x20226a30)
llama : HASH(0x20226988)

So the values themselves are references to anonymous hashes, meaning (as I stated in item
1) that they don't take up a variable name, and as soon as Perl doesn't need them, it
tries to reclaim the memory they take up. Don't fret about them going away unexpectedly,
though. You would need to clear %users or explicitly change a value for a hashref to get
erased, i.e., you would have to want it gone.

My current favorite module, Data::Dumper, gives us output about %users that seems more
sensible:

use Data::Dumper;
print Dumper \%users;
__OUTPUT__
$VAR1 = {
'camel' => {
'last' => 0,
'screen_width' => 1024,
'password' => 'l337'
},
'llama' => {
'last' => 1067978749,
'screen_width' => 800,
'password' => 'hax0r'
}
};

It should be clear that a hash of hashes can be pretty handy for simple database-like
constructs. What's llama's password? Look in $users{llama}{password}. When did camel last
log on? $users{camel}{last}.

I introduced the "each" operator above. That's pretty handy, too. It outputs a single
key/value pair from a hash. Calling it again outputs the next pair, and that will
continue until the end of the hash is reached, and undef is returned after that. So you
can iterate through a hash with a while() loop that calls each. The alternate way, which
is slightly slower on large hashes, but takes up fewer variables, is using the "keys"
operator. "keys" returns a list of a hash's keys. "values" returns a list of a hash's
values, but there is no easy way to correlate them back to their keys, so it is rarely
used.

for (keys %users) { print "$_ : $users{$_}"; }

and

while ( ($key,$value) = each %users ) { print "$key : $value"; }

output the same thing.

A "reference" is like a C pointer. Setting a scalar to a reference usually involves a
backslash. De-referencing a reference usually involves using an idiom such as $$, @{}, or
%{}. The code

$hash_reference = \%users;

will make $hash_reference refer to the %users hash. This is useful for passing a hash to
a subroutine without needing to pass every key and value. Now if I print $hash_reference,
I'll see something like HASH(0x.....). If I want to operate on the contents of %users, I
would use the $$ idiom. For example, llama throws out his 14 inch monitor and buys a 17
inch, changing his resolution to 1024x768:

$$hash_reference{llama}{screen_width} = 1024;

Some people prefer to use the -> dereference method, like so:

$hash_reference->{llama}{screen_width} = 1024;

but I personally don't like the way that looks. Their syntax and speed is identical,
though. Perl sees them as the same command at compile time.

The above lines don't do us any good, though, they just take up more characters. What is
more useful is getting a reference to $users{llama} so that we can cycle through all of
llama's settings. Fortunately, $users{llama} is already a reference, so no backslash is
needed in the assignment.

$llama = $users{llama};
for (keys %$llama)   { print "$_ : $$llama{$_}"; }
__OUTPUT__
last : 1067978749
screen_width : 800
password : hax0r

Notice here that %$llama refers to the hash that $llama is a reference of, and that
$$llama{x} refers to a single value in the hash that $llama refers to. $llama is a
reference, and %$llama and $$llama dereference it in hash and element context,
respectively. (This type of thing took me a long time to get used to.)

Item 5 : The power of the command line

This example assumes that the file "test.txt" has the following content:

line 1
line 2
line 3

Perl can be called from a command line, and it can do some powerful things by smart use of
certain command line switches. The -e switch should be considered a given in command line
Perl. That is the "expression" switch, used to pass Perl statements to the interpreter.

perl -e 'print "Hello World\n"'

The above line does just what you expect, it prints Hello World followed by a linefeed to
the terminal. You can also do simple math:

perl -e 'print 5+6'

The above will print 11 with no newline after it. There are a number of approaches to get
the newline printed after the answer. All of these print the same thing:

perl -e 'print 5+6,"\n"'         # multiple calls to print. Not concatenation
perl -e '$\="\n"; print 5+6'     # Set output record separator to linefeed
perl -e '$ans=5+6; print "$ans\n"'  # Set scalar, then interpolate it with linefeed

Each of these is a little more silly than the previous. The easiest way to get our beloved
linefeed (which becomes more important when multiple things will be printed) is to use the
-l switch. -l (mnemonic aid = Linefeed) chomps linefeeds off of input lines, and adds them
to output lines.

perl -le 'print 5+6'

Concise, easy to read, and 4 characters shorter than the best answer from above.

The real power of commandline Perl comes from the -p and -n switches (Print and No-print).
Both switches tell Perl to iterate through each line of input, assign each line to $_, and
pass $_ to the given expression. -p will print $_ after each iteration, -n will not. Here
is a simple example that simply mimics the cat command:

perl -pe 1 test.txt
__OUTPUT__
line 1
line 2
line 3

This will take each line of test.txt, assign it to $_, and pass it to the expression "1".
As you might expect, "1" doesn't do anything, and so $_ is finally printed unchanged, and
so the output is the same as the input file.

If I wanted to simply see the contents of a file, I wouldn't use that command. I would say
"cat test.txt". If I change my simple "1" expression to a regular expression statement, I
have the functionality of sed:

perl -pe s/line/potato/ test.txt
__OUTPUT__
potato 1
potato 2
potato 3

(By the way, since "s/line/potato/" doesn't have any special shell characters or spaces, I
don't need quotes around it on the command line.) If I really wanted to just change the
first occurrence of "line" to "potato" in the file, I would probably say "sed
s/line/potato/ test.txt" instead. Once you get into the advanced stuff, sed and Perl have
stylistic differences, but similar power. For example, these two commands both print the
same output:

sed '/line 2/!s/line/potato/' test.txt
perl -pe 's/line(?! 2)/potato/' test.txt
__OUTPUT__
potato 1
line 2
potato 3

The sed command says "Where 'line 2' is not found, search for 'line' and replace with
'potato'. The Perl command says "Search for 'line' that is not followed by ' 2' and
replace it with 'potato'. To me, the Perl line is more readable. Your mileage may vary.

The -n switch (-p's counterpart) is a little more handy in real-world applications for the
simple fact that it doesn't dump a bunch of output you don't need to STDOUT. It wraps the
same while loop around the given expression, but doesn't print $_ after every iteration. A
simple use for that is to print only when you've found what you're looking for. In other
words, grep.

perl -ne 'print if /2$/' test.txt
__OUTPUT__
line 2

It is easier to type "grep 2$ test.txt", though.

The -a switch stands for "autosplit", but I use the mnuemonic aid "Awk" instead. Combine
this with the -n switch (and -l to add newlines), and you have the power of Awk with Perl
syntax. The -a switch splits a line by whitespace into the array @F. The split can use
other regexes if specified with the F switch, which always immediately follows -a.

perl -lane 'print $F[1]' test.txt
__OUTPUT__
1
2
3

Which is the same as "awk '{print $2}' test.txt". (Remember that arrays are 0 indexed, so
$F[1] is the second element, the same as awk's $2.) Splitting on /n/ instead of whitespace
looks like this:

perl -aFn -lne 'print $F[1]' test.txt
__OUTPUT__
e 1
e 2
e 3

The same as "awk -Fn '{print $2}' test.txt".

So far, I haven't done anything from the Perl command line that I couldn't do with cat,
sed, grep, or awk in fewer keystrokes. I could argue that the power of Perl combined with
the above commandline switches outweighs anything that the other programs could do
individually, but I'll save that for a simple metrics-grabber example in item 7. Instead,
I'll introduce the coup de grace -- the -i switch.

The -i switch is the in-place file editor. Used in conjunction with -p or -n, the in-place
editor reads a line in, passes it to the -e expression, and all output (if there is any)
replaces the original file. An optional extension can be placed after -i to backup the
original file. Here are a couple of examples from a shell capture:

$ ls -l test.txt*
-rw-------   1 curtis  staff            21 Nov 20 16:00 test.txt
$ cat test.txt
line 1
line 2
line 3
$ perl -i~ -pe s/line/potato/ test.txt
$ cat test.txt
potato 1
potato 2
potato 3
$ ls -l test.txt*
-rw-------   1 curtis  staff            27 Dec 01 10:13 test.txt
-rw-------   1 curtis  staff            21 Nov 20 16:00 test.txt~
$ perl -i -ne 'print unless /3$/' test.txt
$ cat test.txt
potato 1
potato 2
$ ls -l test.txt*
-rw-------   1 curtis  staff            18 Dec 01 10:14 test.txt
-rw-------   1 curtis  staff            21 Nov 20 16:00 test.txt~
$ cat test.txt~
line 1
line 2
line 3
$

Initially, test.txt* returns only one file, which cat shows contains "line 1" through
"line 3". The first Perl command specifies "~" as the backup extension, and replaces all
"line"s with "potato"es. Notice the backup file keeps the original file date.

The next Perl command omits the backup extension. If I used "~" again, my original backup
would be overwritten. This time, I'm printing each line with no changes, but not printing
anything if the line ends with 3. Doing a cat of test.txt afterwards shows that the third
line has been deleted.

The -i switch gives the Perl command line significant muscle. Compare the following Perl
commands with the same processes using temp files, sed and mv:

perl -i -pe s/line/potato/ test.txt
sed s/line/potato/ test.txt > tmpfile; mv tmpfile test.txt

perl -i~ -pe s/line/potato/ test.txt
sed s/line/potato/ test.txt > tmpfile; mv test.txt test.txt~; mv tmpfile test.txt

When working with live files and the -i switch, don't get crazy. A small mistake can cost
you your original file:

$ cat test.txt
line 1
line 2
line 3
$ perl -i -ne 1 test.txt
$ cat test.txt
$

I meant to say "-pe" instead of "-ne", which would have preserved the file. Double-check
what you're doing, and always use a backup extension.

Item 6 : Ternary and range/flip-flop operators

These examples operate on the file test3.txt, which is a psuedo-XML file and contains the
following:

<data>don't print
<data>don't print
<range>
<data>print me
<data>print me, too
<end range>
<data>don't print
<data>don't print

A ternary (or tertiary) operation is a three-part operation. In Perl, as in C and other
C-based languages, the "?" and ":" characters are used to divide the test and the true and
false values. The basic layout is:

test ? true value : false value

For example:

print $var > 3 ? "Greater" : "Not greater";

This will print "Greater" if $var is greater than 3, "Not greater" otherwise. The true and
false values can be functions, variables, or literals. Ternary operators are just a
cleaner way of representing a simple if/else construct. The above line is the same as:

if ($var > 3){print "Greater";}
else {print "Not greater";}

Ternary operators may seem like so much fluff, but when their true and false values are
functions, they can help clean up the logic of otherwise stringy code. Consider the
following code, which prints all <data> lines between <range> and <end range>, and then
prints a final line count:

while(<>)   {
$flag = 1 if /<range>/;
$flag = 0 if /<end range>/;
if ($flag && /<data>/)   {
print;
$cnt++;
}
}
print "$cnt\n";
__OUTPUT__
<data>print me
<data>print me, too
2

..and similar code using a conveniently placed ternary operator to kill two lines of code
and one level of nesting:

while(<>)   {
$flag = 1 if /<range>/;
$flag = 0 if /<end range>/;
$flag && /<data>/ ? $cnt++ : next;
print;
}
print "$cnt\n";
__OUTPUT__
<data>print me
<data>print me, too
2

A little bit nicer, and if I was already nested 3 blocks deep, this would be the
difference between readable and unreadable code. A common use for ternary operators is in
conjunction with Perl's "wantarray" subroutine callback function. The "wantarray" function
can tell if the calling line is being used in scalar or list context. "@array =
some_function($var)" is calling some_function in list context, where "$scalar =
some_function($var)" is calling some_function in scalar context. Subroutines that detect
whether an array or scalar is called for often end like this:

return wantarray ? @the_array : $the_scalar;

A good fit for the ternary operator, and perfect for readability.

Since we're looking for data that is in a range, (between <range> and <end range>), this
is a good place to talk about the range operator, "..". The range operator is usually used
to output a list, but in some cases it is used to do some interesting text manipulation.
The basic use is with a for loop:

perl -e 'for (1..5) {print $_}'
__OUTPUT__
12345

The "1..5" returns a list of (1, 2, 3, 4, 5), which the for loop iterates through, calling
print 5 times. The for loop can be avoided:

perl -e 'print 1..5'
__OUTPUT__
12345

When an alpha range is specified, it does what you'd expect:

perl -e 'print a..t'
__OUTPUT__
abcdefghijklmnopqrst

Alpha ranges increment the same way they do in Excel columns. After Z comes AA, AB, AC,
etc. "print aa..zz" would print out 676 pairs of letters. Neat, but not used much in
real-world code.

When the range operator is used with regular expressions as its operators, it takes on
some special properties, and because of them is sometimes referred to as the flip-flop
operator. The range operator will return false until it's first regular expression finds a
match. Once a match is found, it will return true until the second regex finds a match.
After that, it starts looking for the first regex again.

This will let us access the range of lines in test3.txt that we are interested in very
easily:

while(<>) {
if ( /<range>/ .. /<end range>/ ) { print; }
}
__OUTPUT__
<range>
<data>print me
<data>print me, too
<end range>

To do the same thing as some of the examples above, we need to filter out only lines that
contain <data>, and also keep a running count:

while(<>) { if (/<range>/../<end range>/ and /<data>/) { print; $cnt++; } }
print "$cnt\n";

Handy and compact.

Item 7 - Adding strings, sprintf, x, localtime, and file tests

I'll sum up this tutorial by solving the problem of reliable metrics gathering by date
with a small one-liner that uses some tricks covered in the above examples, and a couple
new ones I'll introduce here.

localtime:

Perl's localtime function takes a time (defaults to the current time if none is specified)
and returns either an array of numbers specifying the current second, minute, hour, etc.,
or it returns a formatted string of the current local time. It is the counterpart of
gmtime, which returns the same thing as localtime, but with no timezone adjustments.

perl -e '$\=","; print for localtime'
__OUTPUT__
37,33,13,1,11,103,1,334,0,
Scnd-|  |  | |  |   | |   | |--Daylight savings time flag
Minutes-|  | |  |   | |   |----Julian day of year, 0 indexed
Hours------| |  |   | |--------Day of week, 0 indexed
Day of month-|  |   |----------Years since 1900
|--------------Month, 0 indexed
 
$ perl -le 'print scalar localtime'                      (Now)
Mon Dec  1 13:39:03 2003
$ perl -le 'print scalar localtime(0)'                   (The "epoch")
Wed Dec 31 19:00:00 1969
$ perl -le 'print scalar gmtime(0)'                      (The epoch according to GMT)
Thu Jan  1 00:00:00 1970

In list context, the month number is 0-indexed, and the year is the number of years since
1900. So, in order to get the current YYYYMMDD, I need to add 1900 to the year, and 1 to
the month.

perl -le '@now=localtime; $now[5]+=1900; $now[4]+=1; print @now[5,4,3];'
__OUTPUT__
2003121

Close to what I'm looking for, which is 20031201. Perl uses C's printf and sprintf
functions, so I can use those for formatting:

perl -e '@now=localtime; printf "%04d%02d%02d", $now[5]+1900, $now[4]+1, $now[3];'
__OUTPUT__
20031201

x:

The x operator is a simple repeat operator for strings. "b" x 3 is the same as "bbb". In
our above example, the %04d for the year could be replaced by %02d, since the year is
already the correct number of digits. Then the quoted string could be replaced by
"%02d"x3, which will save a few characters:

perl -e '@now=localtime; printf "%02d"x3, $now[5]+1900, $now[4]+1, $now[3];'
__OUTPUT__
20031201

Adding strings:

Perl is a "loosely typed" language, meaning that you don't need to declare that a variable
is an integer, a long, or a string. Variables are cast and converted appropriately as
needed. If I add 1900 to "103", I get 2003. If I add 19000100 to "1031101", I get
20031201. Consider the following:

$now = sprintf "%02d" x 3, (localtime)[5,4,3];  # Returns "1031101"

Here we are assigning to $now the results of our sprintf command (printf can only be used
for printing, not for variable assignment). The sprintf command uses "%02d%02d%02d" as its
template, and operates on the 5th, 4th, and 3d elements of the anonymous list returned by
localtime.

The result is a string, but if I add 19000100 to it, the string gets cast as an integer
automatically, returning to me the desired integer that is today's YYYYMMDD date:
20031201. Here are a pair of one-liners demonstrating this concept:

$ perl -le '$now = sprintf "%02d" x 3, (localtime)[5,4,3]; print $now'
1031101
$ perl -le 'print( 19000100 + sprintf"%02d"x3,(localtime)[5,4,3] )'
20031201
$

File tests and the stat command:

Perl has some built-in file tests that mimic the Unix test command. -e returns true if a
file exists. -s returns it's size. -d returns true if what is being looked at is a
directory. There are a few others, but I won't go into them here.

print -e "file.txt" ? "It exists" : "It doesn't exist";

If file.txt exists, this ternary operation will say so.

$total_sizes += -s "file.txt" if -e "file.txt";

If file.txt exists, its size (in bytes) will get added to $total_sizes.

$_ = "file.txt"; $total_sizes += -s if -e && !-d;

Same as the previous example, but only if file.txt is not a directory. This exploits the
fact that $_ is the default variable for all the file test commands.

The Perl stat command tells us a lot about the file being looked at. Like localtime, it
returns a numerical list of attributes. stat returns a 13 element list in the following
order:

0 dev      device number of filesystem
1 ino      inode number
2 mode     file mode  (type and permissions)
3 nlink    number of (hard) links to the file
4 uid      numeric user ID of file's owner
5 gid      numeric group ID of file's owner
6 rdev     the device identifier (special files only)
7 size     total size of file, in bytes
8 atime    last access time in seconds since the epoch
9 mtime    last modify time in seconds since the epoch
10 ctime    inode change time in seconds since the epoch
11 blksize  preferred block size for file system I/O
12 blocks   actual number of blocks allocated

The command "(stat 'file.txt')[9]" will gives us the number of seconds since epoch that
file.txt was modified. This can be fed to localtime to give us a readable time:

$ perl -le 'print ((stat "modules.txt")[9])'
1066321983
$ perl -le 'print scalar localtime 1066321983'
Thu Oct 16 12:33:03 2003

So modules.txt was last modified Thursday, October 16. What if I just wanted the year,
month and day?

$_="modules.txt";
$mod_time = (stat)[9];
print( 19000100 + sprintf"%02d"x3,(localtime $mod_time)[5,4,3] );
__OUTPUT__
20031216

All this can be squished down into an ugly one-liner, eliminating $mod_time:

perl -e
'$_="modules.txt";print(19000100+sprintf"%02d"x3,(localtime((stat)[9]))[5,4,3])'

That's pretty ugly, but given a filename, it can be used as a reliable, relatively small
way to get the last modified date in YYYYMMDD format. Why am I so concerned about that
format? Because it would look great on a hash key.

Suppose I want to know how many bytes of data come through my system each
day. Using some command line switches, file tests, regular expressions, the above ugly
one-liner, and what we know about hashes, we can build a list of dates and total file
sizes for files modified on those days.

We'll need to pipe it in input from the find command and use the -n switch to iterate
through each filename. Each output line from "find" will be a scalar $_. We can take the
current filename $_, test to see if it is a directory, add it's file size to the hash key
$t{YYYYMMDD}. All that can be done with the following commandline:

|------find command   |------Last mod time of $_ in YYYYMMDD format------|
find .| perl -lne '$t{19000100+sprintf"%02d"x3,(localtime((stat)[9]))[5,4,3]}+=-s if !-d'
hash key|------------------------------------------------------|   |      |
Add size of $_ if it isn't a directory|------|

This will give us a nice hash, but it won't print anything out. Since the hash values will
change as the iterations progress, we don't want to print out values as we go. But we
still want the power of the -n switch to iterate over find's output. This can be solved by
introducing the END block.

BEGIN and END:

The BEGIN codeblock is code that is compiled and executed before the rest of the script.
It is typically used to change the file path to find Modules that aren't in the regular
Perl path. For example:

BEGIN { push @INC, '~/my_modules/'; }
use my_cool_module;
...

In contrast, END codeblocks always run at the time a script would exit (but not if the
script dies). It is most often seen on the command line:

perl -lne '$cnt++; END{print $cnt}' file.txt

This will iterate through each line of file.txt, adding one to $cnt for each line. When
finished, $cnt will be printed, effectively giving us a Perl line-counter.

When you're pressed for space, the END block can be declared without the word "END". Just
add an extra closing brace to the end of the script, and a naked open brace before the
final statements. Like so:

perl -lne '$cnt++;}{print $cnt' file.txt

So now we have all the tools we need to create a one-liner to do metrics. Adding to the
ugly example from earlier, I'm adding a check for the /in/ directory (change to /out/ for
outbound counts, and at the end sorting the %t hash's keys, and printing each key-value
pair:

find .|perl -lne'$t{19000100+sprintf"%02d"x3,(localtime((stat)[9]))[5,4,3]}+=-s if!-d && m!/in/!}{print"$_ $t{$_}"for sort keys%t'

Running this from a directory I use for archiving yields the following :

> find .|perl -lne'$t{19000100+sprintf"%02d"x3,(localtime((stat)[9]))[5,4,3]}+=-s if!-d && m!/in/!}{print"$_ $t{$_}"for sort keys%t'
20031023 163
20031105 2465
20031106 3377
20031107 3814
20031108 2636
20031111 3916
20031112 3016
20031113 3035
20031114 3764
20031115 2826
20031118 2579
20031119 3764
20031120 3605
20031121 3270
20031122 2928
20031125 2864
20031126 3251
20031127 3726
>

Ugly, but fast and portable, and I run it about once per week before archiving.

This last item is an extreme example of golfing, or trying to compress as much work into
as few characters as possible. I included it to show Perl's power, but it has an obvious
limitation in readability.

For regular scripting, though, readability is king. If it works, but no one can read it,
then it doesn't work.

Comments: Post a Comment
<< Home