Lab Assignment Solution and Comments, Foundations of Computing

Revised March 28, use your browsers Reload or Refresh button to get the latest version.


Problem statement

The assignment was: write an awk program to check file type. Here is the detailed problem statement

A solution

Here is my solution, which I handed out in class:

# afile - file type detection program in Awk

# Check each of these patterns on every input line
# Use variables to remember whether we have seen each pattern anywhere in file
# Awk variables are initialized to FALSE (0 or ""), any other values mean TRUE

# HTML and Scheme use positive logic
# Make identification if pattern is found anywhere in file

/|/ { html = 1 }   # Found HTML file.  Any nonzero value means TRUE
/\(define \(/     { scheme = 1 } # Found Scheme file.  Use \ to escape (

# Numeric and Text require negative logic 
# Disqualify if forbidden character is found anywhere in file
# Use ^ complement operator in character class

/[^0-9$%*./^= ]/ { not_numeric = 1 } # Found non-numeric character
/[^\000-\177]/   { not_text = 1 }    # Found non-ASCII byte

# When we reach the end of the file, more than one variable might be TRUE.
# Use if-else (similar Scheme cond) to establish precedence

END {  if (html) 
	print "HTML"
       else if (scheme)
        print "Scheme"
       else if (!not_numeric) 
        print "Numeric"
       else if (!not_text)
        print "Text"
       else
        print "Other"
     }

Scoring

I awarded up to four points for each solution, one point each for:

Other solutions

Many solutions were quite different from mine. Here are some interesting approaches.

Many solutions read the entire file into one big string. Then in the END section, just check that string for each pattern. With this method, it is not necessary to use variables to remember what was seen on each line. It looks like this:

# contents variable stores entire file contents in one string
# This rule executed for each line, $0 is entire line
# The awk string concatenation operator is juxtaposition
    { contents = contents $0 " " } 

# At END, check contents in precedence order
END {
       if (contents ~ /<html>|<HTML>/)  { print "HTML" }
       else ...
       ...
    }

Some solutions used alternate (but equally effective) ways of expressing the logic needed to check for the numeric and text types.

/[^0-9]/ { print "This line contains characters that are not digits" }
# $0 is the entire input line, !~ is the not-match operator
$0 !~ /[^0-9]/ { print "This line is all digits" }

Some solutions used exit where the file type could be classified immediately.

/<html>|<HTML>/ { print "HTML"; exit }

Some errors

This section decribes some of the errors people made.

Not checking the entire file

A file is not numeric (or text) if it contains only one non-numeric character (or non-text character). That character might appear at the end of the last line in the file, so you have to check the entire file. Any solution that exits or prints Numeric or Text before reading the entire file must be wrong.

Not checking the numeric file type correctly

The numeric type was allowed to contain digits, spaces, and a specifict list of special characters including $, %, * etc. It was not sufficient to just check that the characters were not alphabetic - not all non-alphabetic characters were permitted.

Incorrect use of logical operators

This does not work as intended:

if (afile == "HTML" || "Scheme" || "Numeric" || "Text")
   print afile
else
   print "Other"

In awk, any nonzero number or nonempty string means true, so "Scheme" etc. will always evaluate to true, so the else branch can never be reached. This is the correct way to express what was probably intended. Each branch of the || or operator must contain a test:

if (afile == "HTML" || afile == "Scheme" || afile == "Numeric" || afile == "Text")
   ...

Incorrect or redundant logic in if ... else if ...

The last else branch in a cascaded if ... else if ... else ... should be the default action that is executed when all the previous if tests are false. There should not be another if after the last else.

In this case the last if to test for Text is redundant because the same condition was already checked for at Other:

if (input ~ /[^\000-\176]/) { print "Other" }
else 
...
else 
if (input ~ /[\000-\176]/) { print "Text" }

The previous example is not incorrect, but this would be sufficient:

if (input ~ /[^\000-\176]/) { print "Other" }
else 
...
else { print "Text" }

Obscure combinations of patterns and operators in if ...

These are not errors but I found them hard to understand.

Several solutions contained this rule:

if (input !~ /[A-Za-z]/ && /[0-9]/) { print "Numeric" }

It prints Numeric when input contains only non-alphabetic characters and digits. I still can't figure out how or why it works. The table on p. 46 of the Awk book suggests && should bind tighter than !~.

Another solution used a pattern in an if:

if (/<html>/ || /<HTML>/) { x = "HTML"'; exit }

I would have thought a match operator ~ was needed. Apparently this matches against the whole input line $0 implicitly.

Test data

I did not require that test data and sample test runs be handed in, but of course you should have tested your program. Some obviously incorrect solutions would have been exposed by a simple test.

Some solutions came with sample test runs on simple one-line test cases that would not have been sufficient to expose errors.

An adequate set of test runs would include:

Checking a whole directory

We can invoke afile from a shell script to handle a whole directory at once

#!/bin/sh
# afile-loop: invoke afile for all the files in a directory
# $1 is first command line argument, should be a directory
# afile script must be in working directory when you run this command
for f in $1/*
do 
  if [ -d $f ]  # test if $f is a directory - Awk chokes on directories
  then
    echo `basename $f` is a directory
  else
    echo `basename $f` is `awk -f afile < $f`
  fi
done 

Here it is in action:

$ ./afile-loop /usr/java/jdk1.3
COPYRIGHT is Text
LICENSE is Text
README is Other
README.html is HTML
bin is a directory
demo is a directory
.. etc. ...
man is a directory
src.jar is Other