In the modern world of fawningly friendly e-retailing, cookies play an essential role in allowing web sites to recognize previous users and to greet them like long-lost, rich, childless uncles. Cookies offer the webmaster a way of remembering her visitors. The cookie is a bit of text, often containing a unique ID number, that is contained in the HTTP header. You can get Apache to concoct and send it automatically, but it is not very hard to do it yourself, and then you have more control over what is happening. You can also get Perl modules to help: CGI.pm and CGI::Cookie. But, as before, we think it is better to start as close as you can to the raw material.
The client's browser keeps a list of cookies and web sites. When the user goes back to a web site, the browser will automatically return the cookie, provided it hasn't expired. If a cookie does not arrive in the header, you, as webmaster, might like to assume that this is a first visit. If there is a cookie, you can tie up the site name and ID number in the cookie with any data you stored the last time someone visited you from that browser. For instance, when we visit Amazon, a cozy message appears: "Welcome back Peter — or Ben — Laurie," because the Amazon system recognizes the cookie that came with our HTTP request because our browser looked up the cookie Amazon sent us last time we visited.
A cookie is a text string. It's minimum content is Name=Value, and these can be anything you like, except semicolon, comma, or whitespace. If you absolutely must have these characters, use URL encoding (described earlier as "&" = "%26", etc.). A useful sort of cookie would be something like this:
Butterthlies=8335562231
Butterthlies identifies the web site that issued it — necessary on a server that hosts many sites. 8335562231 is the ID number assigned to this visitor on his last visit. To prevent hackers upsetting your dignity by inventing cookies that turn out to belong to other customers, you need to generate a rather large random number from an unguessable seed,[57] or protect them cryptographically.
[57]See Larry Wall, Jon Orwant, and Tom Christiansen's Programming Perl (O'Reilly, 2000): "srand" p. 224.
These are other possible fields in a cookie:
Mozilla 3.x and up understands two-digit dates up until "37" (2037). Mozilla 4.x understands up until at least "50" (2050) in 2-digit form, but also understands 4-digit years, which can probably reach up until 9999. Your best bet for sending a long-life cookie is to send it for some time late in the year "37".
The fields are separated by semicolons, thus:
Butterthlies=8335562231; expires=Mon, 27-Apr-2020 13:46:11 GMT
An incoming cookie appears in the Perl variable $ENV{'HTTP_COOKIE'}. If you are using CGI.pm, you can get it dissected automatically; otherwise, you need to take it apart using the usual Perl tools, identify the user and do whatever you want to do to it.
To send a cookie, you write it into the HTTP header, with the prefix Set-Cookie:
Set-Cookie: Butterthlies=8335562231;expires=Mon, 27-Apr-2020 13:46:11 GMT
And don't forget the terminating \n, which completes the HTTP headers.
It has to be said that some people object to cookies — but do they mind if the bartender recognizes them and pours a Bud when they go for a beer? Some sites find it worthwhile to announce in their Privacy Statement that they don't use them.
But you can, if you wish, get Apache to handle the whole thing for you with the directives that follow. In our opinion, Apache cookies are really only useful for tracking visitors through the site — for after-the-fact log file analysis.
To recapitulate: if a site is serving cookies and it gets a request from a user whose browser doesn't send one, the site will create one and issue it. The browser will then store the cookie for as long as CookieExpires allows (see later) and send it every time the user goes to your URL.
However, all Apache does is store the user's cookie in the appropriate log. You have to discover that it's there and do something about it. This will necessarily involve a script (and quite an awkward one too since it has to trawl the log files), so you might just as well do the whole cookie thing in your script and leave these directives alone: it will probably be easier.
CookieName |
CookieName name Server config, virtual host, directory, .htaccess
CookieName allows you to set the name of the cookie served out. The default name is Apache. The new name can contain the characters A-Z, a-z, 0-9, _, and -.
CookieLog |
CookieLog filename Server config, virtual host
CookieLog sets a filename relative to the server rootfor a file in which to log the cookies. It is more usual to configure a field with LogFormat and catch the cookies in the central log (see Chapter 10).
CookieTracking |
CookieExpires expiry-period CookieTracking [on|off] Server config, virtual host, directory, .htaccess
This directive sets an expiration time on the cookie. Without it, the cookie has no expiration date — not even a very faraway one — and this means that it evaporates at the end of the session. The expiry-period can be given as a number of seconds or in a format such as "2 weeks 3 days 7 hours". If the second format is used, the string must be enclosed in double quotes. Valid time periods are as follows:
years
months
weeks
hours
minutes
The Config file is as follows:
User webuser Group webgroup ServerName my586 DocumentRoot /usr/www/APACHE3/site.first/htdocs TransferLog logs/access_log CookieName "my_apache_cookie" CookieLog logs/CookieLog CookieTracking on CookieExpires 10000
In the log file we find:
192.168.123.1.5653981376312508 "GET / HTTP/1.1" [05/Feb/2001:12:31:52 +0000] 192.168.123.1.5653981376312508 "GET /catalog_summer.html HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /bench.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /tree.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /hen.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /bath.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000]
From time to time a CGI script needs to send someone an email. If it's via a link selected by the user, use the HTML construct:
<A HREF="mailto:administrator@butterthlies.com">Click here to email the administrator</A>
The user's normal email system will start up, with the address inserted.
If you want an email to be sent automatically, without the client's collaboration or even her knowledge, then use the Unix sendmail program (see man sendmail). To call it from Perl (A is an arbitrary filename):
open A, "| sendmail -t" or die "couldn't open sendmail pipe $!";
A Win32 equivalent to sendmail seems to be at http://pages.infinit.net/che/blat/blat_f.html. However, the pages are in French. To download, click on "ici" in the line:
Une version récente est ici.
Alternatively, and possibly safer to use, there is the CPAN Mail::Mailer module.
The format of an email is pretty well what you see when you compose one via Netscape or MSIE: addressee, copies, subject, and message appear on separate lines; they are written separated by \n. You would put the message into a Perl variable like this:
$msg=qq(To:fred@hissite.com\nCC:bill@elsewhere.com\nSubject:party tonight\n\nBe at Jane's by 8.00\n);
Notice the double \n at the end of the email header. When the message is all set up, it reads:
print A $msg close A or die "couldn't send email $!";
and away it goes.
Most webmasters will be passionately anxious that their creations are properly indexed by the search engines on the Web, so that the teeming millions may share the delights they offer. At the time of writing, the search engines were coming under a good deal of criticism for being slow, inaccurate, arbitrary, and often plain wrong. One of the more serious criticisms alleged that sites that offered large numbers of separate pages produced by scripts from databases (in other words, most of the serious e-commerce sites) were not being properly indexed. According to one estimate, only 1 page in 500 would actually be found. This invisible material is often called "The Dark Web."
The Netcraft survey of June 2000 visited about 16 million web sites. At the same time Google claimed to be the most comprehensive search engine with 2 million sites indexed. This meant that, at best, only one site in nine could then be found via the best search engine. Perhaps wisely, Google now does not claim a number of sites. Instead it claims (as of August, 2001) to index 1,387,529,000 web pages. Since the Netcraft survey for July 2001 showed 31 million sites (http://www.netcraft.com/Survey/Reports/200107/graphs.html), the implication is that the average site has only 44 pages — which seems too few by a long way and suggests that a lot of sites are not being indexed at all.
The reason seems to be that the search engines spend most of their time and energy fighting off "spam" — attempts to get pages better ratings than they deserve. The spammers used CGI scripts long before databases became prevalent on the Web, so the search engines developed ways of detecting scripts. If their suspicions were triggered, suspect sites would not be indexed. No one outside the search-engine programming departments really knows the truth of the matter — and they aren't telling — but the mythology is that they don't like URLs that contain the characters: "!", "?"; the words "cgi-bin," or the like.
Several commercial development systems betray themselves like this, but if you write your own scripts and serve them up with Apache, you can produce pages that cannot be distinguished from static HTML. Working with script2_html and the corresponding Config file shown earlier, the trick is this:
Remove cgi-bin/ from HREF or ACTION statements. We now have, for instance:
<A HREF="/script2_html/whole_database">Click here to see whole database</A>
Add the line:
ScriptAliasMatch /script(.*) /usr/www/APACHE3/APACHE3/cgi-bin/script$1
to your Config file. The effect is that any URL that begins with /script is caught. The odd looking (.*) is a Perl construct, borrowed by Apache, and means "remember all the characters that follow the word script;'. They reappear in the variable $1 and are tacked onto /usr/www/APACHE3/APACHE3/cgi-bin/script.
As a result, when you click the link, the URL that gets executed, and which the search engines see, is http://www.butterthlies.com/script2_html/whole_database. The fatal words cgi-bin have disappeared, and there is nothing to show that the page returned is not static HTML. Well, apart from the perhaps equally fatal words script or database, which might give the game away . . . but you get the idea.
Another search-engine problem is that most of them cannot make their way through HTML frames. Since many web pages use them, this is a worry and makes one wonder whether the search engines are living in the same time frame as the rest of us. The answer is to provide a cruder home page, with links to all the pages you want indexed, in a <NOFRAMES> area. See your HTML reference book. A useful tool is a really old browser that also does not understand frames, so you can see your pages the way the search engines do. We use a Win 3.x copy of NCSA's Mosaic (download it from http://www.ncsa.uiuc.edu).
The <NOFRAMES> tag will tend to pick out the search engines, but it is not infallible. A more positive way to detect their presence is to watch to see whether the client tries to open the file robots.txt. This is a standard filename that contains instructions to spiders to keep them to the parts of the site you want. See the tutorial at http://www.searchengineworld.com/robots/robots_tutorial.htm. The RFC is at http://www.robotstxt.org/wc/norobots-rfc.html. If the visitor goes for robots.txt, you can safely assume that it is a spider and serve up a simple dish.
The search engines all have their own quirks. Google, for instance, ranks a site by the number of other pages that link to it — which is democratic but tends to hide the quirky bit of information that just interests you. The engines come and go with dazzling rapidity, so if you are in for the long haul, it is probably best to register your site with the big ones and forget about the whole problem. One of us (PL) has a medical encyclopedia (http://www.medic-planet.com). It logs the visits of search engines. After a heart-stopping initial delay of about three months when nothing happened, it now gets visits from several spiders every day and gets a steady flow of visitors that is remarkably constant from month to month.
If you want to make serious efforts to seduce the search engines, look for further information at http://searchengineforms.com and http://searchenginewatch.com.
Debugging CGI scripts can be tiresome because until they are pretty well working, nothing happens on the browser screen. If possible, it is a good idea to test a script every time you change it by running it locally from the command line before you invoke it from the Web. Perl will scan it, looking for syntax errors before it tries to run it. These error reports, which you will find repeated in the error log when you run under Apache, will save you a lot of grief.
Similarly, try out your MySQL calls from the command line to make sure they work before you embed them in a script.
Keep an eye on the Apache error log: it will often give you a useful clue, though it can also be bafflingly silent even though things are clearly going wrong. A common cause of silent misbehavior is a bad call to MySQL. The DBI module never returns, so your script hangs without an explanation in the error log.
As long as you have printed an HTTP header, something (but not necessarily what you want) will usually appear in the browser screen. You can use this fact to debug your scripts, by printing variables or by putting print markers — GOT TO 1<BR>, GOT TO 2<BR> . . . through the code so that you can find out where it goes wrong. (<BR> is the HTML command for a newline). This doesn't always work because these debugging messages may appear in weird places on the screen — or not at all — depending on how thoroughly you have confused the browser. You can also print to error_log from your script:
print STDERR "thing\n";
or to:
warn "thing\n";
If you have an HTML document that sets up frames and you print anything else on the same page, they will not appear. This can be really puzzling.
You can see the HTML that was actually sent to the browser by putting the cursor on the page, right-clicking the mouse, and selecting View Source (or similar, depending on your flavor of browser).
When working with a database, it is often useful to print out the $query variable before the database is accessed. It is worth remembering that although scripts that invoke MySQL will often run from the command line (with various convincing error messages caused by variables not being properly set up), if queries go wrong when the script is run by Apache, they tend to hang without necessarily writing anything to error_log. Often the problem is caused by getting the quote marks wrong or by invoking incorrect field names in the query.
A common, but enigmatic, message in error_log is: Premature end of script headers. This signals that the HTTP header went wrong and can be caused by several different mistakes:
Your script refused to run at all. Run it from the command line and correct any Perl errors. Try making it executable with chmod +x <scriptname>.
Your script has the wrong permissions to run under Apache.
The HTTP headers weren't printed, or the final \n was left off it.
It generated an error before printing headers — look above in the error log.
Occasionally, these simple tricks do not work, and you need to print variables to a file to follow what is going on. If you print your error messages to STDERR, they will appear in the error log. Alternatively, if you want errors printed to your own file, remember that any program executed by Apache belongs to the useless webuser, and it can only write files without permission problems in webuser's home directory. You can often elicit useful error messages by using:
open B,">>/home/webserver/script_errors" or die "couldn't open: $!"; close B;
Sometimes you have to deal with a bit of script that prints no page. For instance, when WorldPay (described in Chapter 12) has finished with a credit card transaction, it can call a link to your web site again. You probably will want the script to write the details of the transaction to the database, but there is no browser to print debugging messages. The only way out is to print them to a file, as earlier.
If you are programming your script in Perl, the CGI::Carp module can be helpful. However, most other languages[58] that you might want to use for CGI do not have anything so useful.
[58]We'll include ordinary shell scripts as "languages," which, in many senses, they are.
If you are programming in a high-level language and want to run a debugger, it is usually impossible to do so directly. However, it is possible to simulate the environment in which an Apache script runs. The first thing to do is to become the user that Apache runs as. Then, remember that Apache always runs a script in the script's own directory, so go to that directory. Next, Apache passes most of the information a script needs in environment variables. Determine what those environment variables should be (either by thinking about it or, more reliably, by temporarily replacing your CGI with one that executes env, as illustrated earlier), and write a little script that sets them then runs your CGI (possibly under a debugger). Since Apache sets a vast number of environment variables, it is worth knowing that most CGI scripts use relatively few of them — usually only QUERY_STRING (or PATH_INFO, less often). Of course, if you wrote the script and all its libraries, you'll know what it used, but that isn't always the case. So, to give a concrete example, suppose we wanted to debug some script written in C. We'd go into .../cgi-bin and write a script called, say, debug.cgi, that looked something like this:
#!/bin/sh QUERY_STRING='2315_order=20&2316_order=10&card_type=Amex' export QUERY_STRING gdb mycgi
We'd run it by typing:
chmod +x debug.cgi ./debug.cgi
Once gdb came up, we'd hit r<CR>, and the script would run.[59]
[59]Obviously, if we really wanted to debug it, we'd set some breakpoints first.
A couple of things may trip you up here. The first is that if the script expects the POST method — that is, if REQUEST_METHOD is set to POST — the script will (if it is working correctly) expect the QUERY_STRING to be supplied on its standard input rather than in the environment. Most scripts use a library to process the query string, so the simple solution is to not set REQUEST_METHOD for debugging, or to set it to GET instead. If you really must use POST, then the script would become:
#!/bin/sh REQUEST_METHOD=POST export REQUEST_METHOD mycgi << EOF 2315_order=20&2316_order=10&card_type=Amex EOF
Note that this time we didn't run the debugger, for the simple reason that the debugger also wants input from standard input. To accommodate that, put the query string in some file, and tell the debugger to use that file for standard input (in gdb 's case, that means type r < yourfile).
The second tricky thing occurs if you are using Perl and the standard Perl module CGI.pm. In this case, CGI helpfully detects that you aren't running under Apache and prompts for the query string. It also wants the individual items separated by newlines instead of ampersands. The simple solution is to do something very similar to the solution to the POST problem we just discussed, except with newlines.
Security should be the sensible webmasters' first and last concern. This list of questions, all of which you should ask yourself, is from Sysadmin: The Journal for Unix System Administrators, at http://www.samag.com/current/feature.shtml. See also Chapter 11 and Chapter 12.
Is all input parsed to ensure that the input is not going to make the CGI script do something unexpected? Is the CGI script eliminating or escaping shell metacharacters if the data is going to be passed to a subshell? Is all form input being checked to ensure that all values are legal? Is text input being examined for malicious HTML tags?
Is the CGI script starting subshells? If so, why? Is there a way to accomplish the same thing without starting a subshell?
Is the CGI script relying on possibly insecure environment variables such as PATH?
If the CGI script is written in C, or another language that doesn't support safe string and array handling, is there any case in which input could cause the CGI script to store off the end of a buffer or array?
If the CGI script is written in Perl, is taint checking being used?
Is the CGI script SUID or SGID? If so, does it really need to be? If it is running as the superuser, does it really need that much privilege? Could a less privileged user be set up? Does the CGI script give up its extra privileges when no longer needed?
Are there any programs or files in CGI directories that don't need to be there or should not be there, such as shells and interpreters?
Perl can help. Put this at the top of your scripts:
#! /usr/local/bin/perl -w -T use strict; ....
The -w flag to Perl prints various warning messages at runtime. -T switches on taint checking, which prevents the malicious program the Bad Guys send you disguised as data doing anything bad. The line use strict checks that your variables are properly declared.
On security questions in general, you might like to look at Lincoln Stein's well regarded "Secure CGI FAQ" at http://www-genome.wi.mit.edu/WWW/faqs/www-security-faq.html.
Copyright © 2003 O'Reilly & Associates. All rights reserved.