Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

UNIX UNLEASHED PHẦN 3 ppt
Nội dung xem thử
Mô tả chi tiết
Part III — Networking with NetWare
Awk, Awk
Perl
The C Programming Language
15
o Awk, Awk
o By Ann Marshall
Overview
Uses
Features
Brief History
Fundamentals
Entering Awk from the Command Line
Files for Input
The Program File
Specifying Output on the Command Line
Patterns and Actions
Input
Fields
Program Format
A Note on awk Error Messages
Print Selected Fields
Program Components
The Input File and Program
Patterns
BEGIN and END
Expressions
String Matching
Range Patterns
Compound Patterns
Actions
Variables
Naming
Awk in a Shell Script
Built-in Variables
Conditions (No IFs, &&s or buts)
The if Statement
The Conditional Statement
Patterns as Conditions
Loops
Increment and Decrement
The While Statement
The Do Statement
The For Statement
Loop Control
Strings
Built-In String Functions
String Constants
Arrays
Array Specialties
Arithmetic
Operators
Numeric Functions
Input and Output
Input
The Getline Statement
Output
The printf Statement
Closing Files and Pipes
Command Line Arguments
Passing Command Line Arguments
Setting Variables on the Command Line
Functions
Function Definition
Parameters
Variables
Function Calls
The Return Statement
Writing Reports
BEGIN and END Revisited
The Built-in System Function
Advanced Concepts
Multi-Line Records
Multidimensional Arrays
Summary
Further Reading
Obtaining Source Code
15
Awk, Awk
By Ann Marshall
Overview
The UNIX utility awk is a pattern matching and processing language with considerably
more power than you may realize. It searches one or more specified files, checking for
records that match a specified pattern. If awk finds a match, the corresponding action is
performed. A simple concept, but it results in a powerful tool. Often an awk program is
only a few lines long, and because of this, an awk program is often written, used, and
discarded. A traditional programming language, such as Pascal or C, would take more
thought, more lines of code, and hence, more time. Short awk programs arise from two of
its built-in features: the amount of predefined flexibility and the number of details that are
handled by the language automatically. Together, these features allow the manipulation
of large data files in short (often single-line) programs, and make awk stand apart from
other programming languages. Certainly any time you spend learning awk will pay
dividends in improved productivity and efficiency.
Uses
The uses for awk vary from the simple to the complex. Originally awk was intended for
various kinds of data manipulation. Intentionally omitting parts of a file, counting
occurrences in a file, and writing reports are naturals for awk.
Awk uses the syntax of the C programming language, so if you know C, you have an idea
of awk syntax. If you are new to programming or don't know C, learning awk will
familiarize you with many of the C constructs.
Examples of where awk can be helpful abound. Computer-aided manufacturing, for
example, is plagued with nonstandardization, so the output of a computer that's running a
particular tool is quite likely to be incompatible with the input required for a different
tool. Rather than write any complex C program, this type of simple data transformation is
a perfect awk task.
One real problem of computer-aided manufacturing today is that no standard format yet
exists for the program running the machine. Therefore, the output from Computer A
running Machine A probably is not the input needed for Computer B running Machine B.
Although Machine A is finished with the material, Machine B is not ready to accept it.
Production halts while someone edits the file so it meets Computer B's needed format.
This is a perfect and simple awk task.
Due to the amount of built-in automation within awk, it is also useful for rapid
prototyping or trying out an idea that could later be implemented in another language.
Features
Reflecting the UNIX environment, awk features resemble the structures of both C and
shell scripts. Highlights include its being flexible, its predefined variables, automation, its
standard program constructs, conventional variable types, its powerful output formatting
borrowed from C, and its ease of use.
The flexibility means that most tasks may be done more than one way in awk. With the
application in mind, the programmer chooses which method to use . The built-in
variables already provide many of the tools to do what is needed. Awk is highly
automated. For instance, awk automatically retrieves each record, separates it into fields,
and does type conversion when needed without programmer request. Furthermore, there
are no variable declarations. Awk includes the "usual" programming constructs for the
control of program flow: an if statement for two way decisions and do, for and while
statements for looping. Awk also includes its own notational shorthand to ease typing.
(This is UNIX after all!) Awk borrows the printf() statement from C to allow "pretty" and
versatile formats for output. These features combine to make awk user friendly.
Brief History
Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977. (The
name is from the creators' last initials.) In 1985, more features were added, creating nawk
(new awk). For quite a while, nawk remained exclusively the property of AT&T, Bell
Labs. Although it became part of System V for Release 3.1, some versions of UNIX, like
SunOS, keep both awk and nawk due to a syntax incompatibility. Others, like System V
run nawk under the name awk (although System V. has nawk too). In The Free Software
Foundation, GNU introduced their version of awk, gawk, based on the IEEE POSIX
(Institute of Electrical and Electronics Engineers, Inc., IEEE Standard for Information
Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2,
ANSI approved 4/5/93), awk standard which is different from awk or nawk. Linux, PC
shareware UNIX, uses gawk rather than awk or nawk. Throughout this chapter I have
used the word awk when any of the three will do the concept. The versions are mostly
upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, then gawk as
shown below. I have used the notation version++ to denote a concept that began in that
version and continues through any later versions.
NOTE: Due to different syntax, awk code can never be upgraded to nawk.
However, except as noted, all the concepts of awk are implemented in nawk (and gawk).
Where it matters, I have specified the version.
Figure 15.1. The evolution of awk.
Refer to the end of the chapter for more information and further resources on awk and its
derivatives.
Fundamentals
This section introduces the basics of the awk programming language. Although my
discussion first skims the surface of each topic to familiarize you with how awk
functions, later sections of the chapter go into greater detail. One feature of awk that
almost continually holds true is this: you can do most tasks more than one way. The
command line exemplifies this. First, I explain the variety of ways awk may be called
from the command line—using files for input, the program file, and possibly an output
file. Next, I introduce the main construct of awk, which is the pattern action statement.
Then, I explain the fundamental ways awk can read and transform input. I conclude the
section with a look at the format of an awk program.
Entering Awk from the Command Line
In its simplest form, awk takes the material you want to process from standard input and
displays the results to standard output (the monitor). You write the awk program on the
command line. The following table shows the various ways you can enter awk and input
material for processing.
You can either specify explicit awk statements on the command line, or, with the -f flag,
specify an awk program file that contains a series of awk commands. In addition to the
standard UNIX design allowing for standard input and output, you can, of course, use file
redirection in your shell, too, so awk < inputfile is functionally identical to awk inputfile.
To save the output in a file, again use file redirection: awk > outputfile does the trick.
Helpfully, awk can work with multiple input files at once if they are specified on the
command line.
The most common way to see people use awk is as part of a command pipe, where it's
filtering the output of a command. An example is ls -l | awk {print $3} which would print
just the third column of each line of the ls command. Awk scripts can become quite
complex, so if you have a standard set of filter rules that you'd like to apply to a file, with
the output sent directly to the printer, you could use something like awk -f myawkscript
inputfile | lp.
TIP: If you opt to specify your awk script on the command line, you'll find it
best to use single quotes to let you use spaces and to ensure that the command shell
doesn't falsely interpret any portion of the command.
Files for Input
These input and output places can be changed if desired. You can specify an input file by
typing the name of the file after the program with a blank space between the two. The
input file enters the awk environment from your workstation keyboard (standard input).
To signal the end of the input file, type Ctl + d. The program on the command line
executes on the input file you just entered and the results are displayed on the monitor
(the standard output.)
Here's a simple little awk command that echoes all lines I type, prefacing each with the
number of words (or fields, in awk parlance, hence the NF variable for number of fields)
in the line. (Note that Ctrl+d means that while holding down the Control key you should
press the d key).
$ awk '{print $NF : $0}'
I am testing my typing.
A quick brown fox jumps when vexed by lazy ducks.
Ctrl+d
5: I am testing my typing.
10: A quick brown fox jumps when vexed by lazy ducks.
$ _
You can also name more than one input file on the command line, causing the combined
files to act as one input. This is one way of having multiple runs through one input file.
TIP: Keep in mind that the correct ordering on the command line is crucial for
your program to work correctly: files are read from left to right, so if you want to have
file1 and file2 read in that order, you'll need to specify them as such on the command
line.
The Program File
With awk's automatic type conversion, a file of names and a file of numbers entered in
the reverse order at the command line generate strange-looking output rather than an
error message. That is why for longer programs, it is simpler to put the program in a file
and specify the name of the file on the command line. The -f option does this. Notice that
this is an exception to the usual way UNIX handles options. Usually the options occur at
the end of a command; however, here an input file is the last parameter.
NOTE: Versions of awk that meet the POSIX awk specifications are allowed to
have multiple -f options. You can use this for running multiple programs using the same
input.
Specifying Output on the Command Line
Output from awk may be redirected to a file or piped to another program (see Chapter 4).
The command awk /^5/ {print $0} | grep 3, for example, will result in just those lines that
start with the digit five (that's what the awk part does) and also contain the digit three (the
grep command). If you wanted to save that output to a file, by contrast, you could use
awk /^5/ {print $0} > results and the file results would contain all lines prefaced by the
digit 5. If you opt for neither of these courses, the output of awk will be displayed on
your screen directly, which can be quite useful in many instances, particularly when
you're developing—or fine tuning—your awk script.
Patterns and Actions
Awk programs are divided into three main blocks; the BEGIN block, the per-statement
processing block, and the END block. Unless explicitly stated, all statements to awk
appear in the per-statement block (you'll see later where the other blocks can come in
particularly handy for programming, though).
Statements within awk are divided into two parts: a pattern, telling awk what to match,
and a corresponding action, telling awk what to do when a line matching the pattern is
found. The action part of a pattern action statement is enclosed in curly braces ({}) and
may be multiple statements. Either part of a pattern action statement may be omitted. An
action with no specified pattern matches every record of the input file you want to search
(that's how the earlier example of {print $0} worked). A pattern without an action
indicates that you want input records to be copied to the output file as they are (i.e.,
printed).
The example of /^5/ {print $0} is an example of a two-part statement: the pattern here is
all lines that begin with the digit five (the ^ indicates that it should appear at the
beginning of the line: without it the pattern would say any line that includes the digit five)
and the action is print the entire line verbatim. ($0 is shorthand for the entire line.)
Input
Awk automatically scans, in order, each record of the input file looking for each pattern
action statement in the awk program. Unless otherwise set, awk assumes each record is a
single line. (See the sections "Advanced Concepts","Multi-line Records" for how to
change this.) If the input file has blank lines in it, the blank lines count as a record too.
Awk automatically retrieves each record for analysis; there is no read statement in awk.
A programmer may also disrupt the automatic input order in of two ways: the next and
exit statements. The next statement tells awk to retrieve the next record from the input
file and continue without running the current input record through the remaining portion
of pattern action statements in the program. For example, if you are doing a crossword
puzzle and all the letters of a word are formed by previous words, most likely you
wouldn't even bother to read that clue but simply skip to the clue below; this is how the
next statement would work, if your list of clues were the input. The other method of
disrupting the usual flow of input is through the exit statement. The exit statement
transfers control to the END block—if one is specified—or quits the program, as if all the
input has been read; suppose the arrival of a friend ends your interest in the crossword
puzzle, but you still put the paper away. Within the END block, an exit statement causes
the program to quit.
An input record refers to the entire line of a file including any characters, spaces, or Tabs.
The spaces and tabs are called whitespace.
TIP: If you think that your input file may include both spaces and tabs, you
can save yourself a lot of confusion by ensuring that all tabs become spaces with the
expand program. It works like this: expand filename | awk { stuff }.
The whitespace in the input file and the whitespace in the output file are not related and
any whitespace you want in the output file, you must explicitly put there.
Fields
A group of characters in the input record or output file is called a field. Fields are
predefined in awk: $1 is the first field, $2 is the second, $3 is the third, and so on. $0
indicates the entire line. Fields are separated by a field separator (any single character
including Tab), held in the variable FS. Unless you change it, FS has a space as its value.
FS may be changed by either starting the programfile with the following statement:
BEGIN {FS = "char" }
or by setting the -Fchar command line option where char is the selected field separator
character you want to use.
One file that you might have viewed which demonstrates where changing the field
separator could be helpful is the /etc/passwd file that defines all user accounts. Rather
than having the different fields separated by spaces or tabs, the password file is structured
with lines:
news:?:6:11:USENET News:/usr/spool/news:/bin/ksh
Each field is separated by a colon! You could change each colon to a space (with sed, for
example), but that wouldn't work too well: notice that the fifth field, USENET News,
contains a space already. Better to change the field separator. If you wanted to just have a
list of the fifth fields in each line, therefore, you could use the simple awk command awk
-F: {print $5} /etc/passwd.
Likewise, the built-in variable OFS holds the value of the output field separator. OFS also
has a default value of a space. It, too, may be changed by placing the following line at the
start of a program.
BEGIN {OFS = "char" }
If you want to automatically translate the passwd file so that it listed only the first and
fifth fields, separated by a tab, you can therefore use the awk script:
BEGIN { FS=":" ; OFS=" " }
{ print $1, $5 }
Notice here that the script contains two blocks: the BEGIN block and the main per-input
line block. Also notice that most of the work is done automatically.
Program Format
With a few noted exceptions, awk programs are free format. The interpreter ignores any
blank lines in a programfile. Add them to improve the readability of your program
whenever you wish. The same is true for Tabs and spaces between operators and the parts
of a program. Therefore, these two lines are treated identically by the awk interpreter.
$4 == 2 {print "Two"}
$4 == 2 { print "Two" }
If more than one pattern action line appears on a line, you'll need to separate them with a
semicolon, as shown above in the BEGIN block for the passwd file translator. If you stick
with one-command-per-line then you won't need to worry too much about the
semicolons. There are a couple of spots, however, where the semicolon must always be
used: before an else statement or when included in the syntax of a statement. (See the
"Loops" or "The Conditional Statement" sections.) However, you may always put a
semicolon at the end of a statement.
The other format restriction for awk programs is that at least the opening curly bracket of
the action half of a pattern action statement must be on the same line as the
accompanying pattern, if both pattern and action exist. Thus, following examples all do
the same thing.
The first shows all statements on one line:
$2==0 {print ""; print ""; print "";}
The second with the first statement on the same line as the pattern to match:
$2==0 { print ""
print ""
print ""}
and finally as spread out as possible:
$2==0 {
print ""
print ""
print ""
}
When the second field of the input file is equal to 0, awk prints three blank lines to the
output file.
NOTE: Notice that print "" prints a blank line to the output file, whereas the
statement print alone prints the current input line.
When you look at an awk program file, you may also find commentary within. Anything
typed from a # to the end of the line is considered a comment and is ignored by awk.
They are notes to anyone reading the program to explain what is going on in words, not
computerese.
A Note on awk Error Messages
Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity of
the program, a typo is easily found. Not all errors are as obvious; I have scattered some
examples of errors throughout this chapter.
Print Selected Fields
Awk includes three ways to specify printing. The first is implied. A pattern without an
action assumes that the action is to print. The two ways of actively commanding awk to
print are print and printf(). For now, I am going to stick to using only implied printing
and the print statement. printf is discussed in a later section ("Input/Output") and is used
mainly for precise output. This section demonstrates the first two types of printing
through some step-by-step examples.
Program Components
If I want to be sure the System Administrator spelled my name correctly in the
/etc/password file, I enter an awk command to find a match but omit an action. The
following command line puts a list on-screen.
$ awk '/Ann/' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
andhs26:0TFnZSVwcua3Y:2488:23:DeAnn
O'Neal:/usr/lstudent/andhs26:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann
McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn
Flanagan:/usr/lteach/jflanaga:/bin/csh
lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz,
:/usr/lteach/lschultz:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker
:/usr/bakehs59:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann
Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _
I look on the monitor and see the correct spelling.
ERROR NOTE: For the sake of making a point, suppose I had chosen the pattern
/Anne/. A quick glance above shows that there would be no matches. Entering awk
'/Anne/' /etc/passwd will therefore produce nothing but another system prompt to the
monitor. This can be confusing if you expect output. The same goes the other way;
above, I wanted the name Ann, but the names LeAnn, Annie and DeAnna matched, too.
Sometimes choosing a pattern too long or too short can cause an unneeded headache.
TIP: If a pattern match is not found, look for a typo in the pattern you are
trying to match.
Printing specified fields of an ASCII (plain text) file is a straightforward awk task.
Because this program example is so short, only the input is in a file. The first input file,
"sales", is a file of car sales by month. The file consists of each salesperson's name,
followed by a monthly sales figure. The end field is a running total of that person's total
sales.
The Input File and Program
$cat sales
John Anderson,12,23,7,42
Joe Turner,10,25,15,50
Susan Greco,15,13,18,46
Bob Burmeister,8,21,17,46
The following command line prints the salesperson's name and the total sales for the first
quarter.
awk -F, '{print $1,$5}' sales
John Anderson 42
Joe Turner 50
Susan Greco 46
Bob Burmeister 46
A comma (,) between field variables indicates that I want OFS applied between output
fields as shown in a previous example. Remember without the comma, no field separator
will be used, and the displayed output fields (or output file) will all run together.
TIP: Putting two field separators in a row inside a print statement creates a
syntax error with the print statement; however, using the same field twice in a single print
statement is valid syntax. For example:
awk '{print($1,$1)'
Patterns
A pattern is the first half of an awk program statement. In awk there are six accepted
pattern types. This section discusses each of the six in detail. You have already seen a
couple of them, including BEGIN, and a specified, slash-delimited pattern, in use. Awk
has many string matching capabilities arising from patterns, and the use of regular
expressions in patterns. A range pattern locates a sequence. All patterns except range
patterns may be combined in a compound pattern.
I began the chapter by saying awk was a pattern-match and process language. This
section explores exactly what is meant by a pattern match. As you'll see, what kind
pattern you can match depends on exactly how you're using the awk pattern specification
notation.
BEGIN and END
The two special patterns BEGIN and END may be used to indicate a match, either before
the first input record is read, or after the last input record is read, respectively. Some
versions of awk require that, if used, BEGIN must be the first pattern of the program and,
if used, END must be the last pattern of the program. While not necessarily a
requirement, it is nonetheless an excellent habit to get into, so I encourage you to do so,
as I do throughout this chapter. Using the BEGIN pattern for initializing variables is
common (although variables can be passed from the command line to the program too;
see "Command Line Arguments") The END pattern is used for things which are inputdependent such as totals.
If I want to know how many lines are in a given program, I type the following line:
$awk 'END {print _Total lines: _$NR}' myprogram
I see Total lines: 256 on the monitor and therefore know that the file myprogram has 256
lines. At any point while awk is processing the file, the variable NR counts the number of
records read so far. NR at the end of a file has a value equal to the number of lines in the
file.
How might you see a BEGIN block in use? Your first thought might be to initialize
variables, but if it's a numeric value, it's automatically initialized to zero before its first
use. Instead, perhaps you're building a table of data and want to have some columnar
headings. With this in mind, here's a simple awk script that shows you all the accounts
that people named Dave have on your computer:
BEGIN {
FS=_:_ # remember that the passwd file uses colons
OFS=_ _ # we_re setting the output to a TAB
print _Account_,_Username_
}
/Dav/ {print $1, $5}
Here's what it looks like in action (we've called this file _daves.awk_, though the
program matches Dave and David, of course):
$ awk -f daves.awk /etc/passwd
Account Username
andrews Dave Andrews
d3 David Douglas Dunlap
daves Dave Smith
taylor Dave Taylor
Note that you could also easily have a summary of the total number of matched accounts
by adding a variable that's incremented for each match, then in the END block output in
some manner. Here's one way to do it:
BEGIN { FS=_:_ ; OFS=_ _ # input colon separated, output tab
separated
print _Account_,_Username_
}
/Dav/ {print $1, $5 ; matches++ }
END { print _A total of _matches_ matches._}