Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

UNIX UNLEASHED PHẦN 3 ppt
PREMIUM
Số trang
142
Kích thước
993.9 KB
Định dạng
PDF
Lượt xem
1844

UNIX UNLEASHED PHẦN 3 ppt

Nội dung xem thử

Mô tả chi tiết

Part III — Networking with NetWare

Awk, Awk

Perl

The C Programming Language

 15

o Awk, Awk

o By Ann Marshall

 Overview

 Uses

 Features

 Brief History

 Fundamentals

 Entering Awk from the Command Line

 Files for Input

 The Program File

 Specifying Output on the Command Line

 Patterns and Actions

 Input

 Fields

 Program Format

 A Note on awk Error Messages

 Print Selected Fields

 Program Components

 The Input File and Program

 Patterns

 BEGIN and END

 Expressions

 String Matching

 Range Patterns

 Compound Patterns

 Actions

 Variables

 Naming

 Awk in a Shell Script

 Built-in Variables

 Conditions (No IFs, &&s or buts)

 The if Statement

 The Conditional Statement

 Patterns as Conditions

 Loops

 Increment and Decrement

 The While Statement

 The Do Statement

 The For Statement

 Loop Control

 Strings

 Built-In String Functions

 String Constants

 Arrays

 Array Specialties

 Arithmetic

 Operators

 Numeric Functions

 Input and Output

 Input

 The Getline Statement

 Output

 The printf Statement

 Closing Files and Pipes

 Command Line Arguments

 Passing Command Line Arguments

 Setting Variables on the Command Line

 Functions

 Function Definition

 Parameters

 Variables

 Function Calls

 The Return Statement

 Writing Reports

 BEGIN and END Revisited

 The Built-in System Function

 Advanced Concepts

 Multi-Line Records

 Multidimensional Arrays

 Summary

 Further Reading

 Obtaining Source Code

15

Awk, Awk

By Ann Marshall

Overview

The UNIX utility awk is a pattern matching and processing language with considerably

more power than you may realize. It searches one or more specified files, checking for

records that match a specified pattern. If awk finds a match, the corresponding action is

performed. A simple concept, but it results in a powerful tool. Often an awk program is

only a few lines long, and because of this, an awk program is often written, used, and

discarded. A traditional programming language, such as Pascal or C, would take more

thought, more lines of code, and hence, more time. Short awk programs arise from two of

its built-in features: the amount of predefined flexibility and the number of details that are

handled by the language automatically. Together, these features allow the manipulation

of large data files in short (often single-line) programs, and make awk stand apart from

other programming languages. Certainly any time you spend learning awk will pay

dividends in improved productivity and efficiency.

Uses

The uses for awk vary from the simple to the complex. Originally awk was intended for

various kinds of data manipulation. Intentionally omitting parts of a file, counting

occurrences in a file, and writing reports are naturals for awk.

Awk uses the syntax of the C programming language, so if you know C, you have an idea

of awk syntax. If you are new to programming or don't know C, learning awk will

familiarize you with many of the C constructs.

Examples of where awk can be helpful abound. Computer-aided manufacturing, for

example, is plagued with nonstandardization, so the output of a computer that's running a

particular tool is quite likely to be incompatible with the input required for a different

tool. Rather than write any complex C program, this type of simple data transformation is

a perfect awk task.

One real problem of computer-aided manufacturing today is that no standard format yet

exists for the program running the machine. Therefore, the output from Computer A

running Machine A probably is not the input needed for Computer B running Machine B.

Although Machine A is finished with the material, Machine B is not ready to accept it.

Production halts while someone edits the file so it meets Computer B's needed format.

This is a perfect and simple awk task.

Due to the amount of built-in automation within awk, it is also useful for rapid

prototyping or trying out an idea that could later be implemented in another language.

Features

Reflecting the UNIX environment, awk features resemble the structures of both C and

shell scripts. Highlights include its being flexible, its predefined variables, automation, its

standard program constructs, conventional variable types, its powerful output formatting

borrowed from C, and its ease of use.

The flexibility means that most tasks may be done more than one way in awk. With the

application in mind, the programmer chooses which method to use . The built-in

variables already provide many of the tools to do what is needed. Awk is highly

automated. For instance, awk automatically retrieves each record, separates it into fields,

and does type conversion when needed without programmer request. Furthermore, there

are no variable declarations. Awk includes the "usual" programming constructs for the

control of program flow: an if statement for two way decisions and do, for and while

statements for looping. Awk also includes its own notational shorthand to ease typing.

(This is UNIX after all!) Awk borrows the printf() statement from C to allow "pretty" and

versatile formats for output. These features combine to make awk user friendly.

Brief History

Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977. (The

name is from the creators' last initials.) In 1985, more features were added, creating nawk

(new awk). For quite a while, nawk remained exclusively the property of AT&T, Bell

Labs. Although it became part of System V for Release 3.1, some versions of UNIX, like

SunOS, keep both awk and nawk due to a syntax incompatibility. Others, like System V

run nawk under the name awk (although System V. has nawk too). In The Free Software

Foundation, GNU introduced their version of awk, gawk, based on the IEEE POSIX

(Institute of Electrical and Electronics Engineers, Inc., IEEE Standard for Information

Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2,

ANSI approved 4/5/93), awk standard which is different from awk or nawk. Linux, PC

shareware UNIX, uses gawk rather than awk or nawk. Throughout this chapter I have

used the word awk when any of the three will do the concept. The versions are mostly

upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, then gawk as

shown below. I have used the notation version++ to denote a concept that began in that

version and continues through any later versions.

NOTE: Due to different syntax, awk code can never be upgraded to nawk.

However, except as noted, all the concepts of awk are implemented in nawk (and gawk).

Where it matters, I have specified the version.

Figure 15.1. The evolution of awk.

Refer to the end of the chapter for more information and further resources on awk and its

derivatives.

Fundamentals

This section introduces the basics of the awk programming language. Although my

discussion first skims the surface of each topic to familiarize you with how awk

functions, later sections of the chapter go into greater detail. One feature of awk that

almost continually holds true is this: you can do most tasks more than one way. The

command line exemplifies this. First, I explain the variety of ways awk may be called

from the command line—using files for input, the program file, and possibly an output

file. Next, I introduce the main construct of awk, which is the pattern action statement.

Then, I explain the fundamental ways awk can read and transform input. I conclude the

section with a look at the format of an awk program.

Entering Awk from the Command Line

In its simplest form, awk takes the material you want to process from standard input and

displays the results to standard output (the monitor). You write the awk program on the

command line. The following table shows the various ways you can enter awk and input

material for processing.

You can either specify explicit awk statements on the command line, or, with the -f flag,

specify an awk program file that contains a series of awk commands. In addition to the

standard UNIX design allowing for standard input and output, you can, of course, use file

redirection in your shell, too, so awk < inputfile is functionally identical to awk inputfile.

To save the output in a file, again use file redirection: awk > outputfile does the trick.

Helpfully, awk can work with multiple input files at once if they are specified on the

command line.

The most common way to see people use awk is as part of a command pipe, where it's

filtering the output of a command. An example is ls -l | awk {print $3} which would print

just the third column of each line of the ls command. Awk scripts can become quite

complex, so if you have a standard set of filter rules that you'd like to apply to a file, with

the output sent directly to the printer, you could use something like awk -f myawkscript

inputfile | lp.

TIP: If you opt to specify your awk script on the command line, you'll find it

best to use single quotes to let you use spaces and to ensure that the command shell

doesn't falsely interpret any portion of the command.

Files for Input

These input and output places can be changed if desired. You can specify an input file by

typing the name of the file after the program with a blank space between the two. The

input file enters the awk environment from your workstation keyboard (standard input).

To signal the end of the input file, type Ctl + d. The program on the command line

executes on the input file you just entered and the results are displayed on the monitor

(the standard output.)

Here's a simple little awk command that echoes all lines I type, prefacing each with the

number of words (or fields, in awk parlance, hence the NF variable for number of fields)

in the line. (Note that Ctrl+d means that while holding down the Control key you should

press the d key).

$ awk '{print $NF : $0}'

I am testing my typing.

A quick brown fox jumps when vexed by lazy ducks.

Ctrl+d

5: I am testing my typing.

10: A quick brown fox jumps when vexed by lazy ducks.

$ _

You can also name more than one input file on the command line, causing the combined

files to act as one input. This is one way of having multiple runs through one input file.

TIP: Keep in mind that the correct ordering on the command line is crucial for

your program to work correctly: files are read from left to right, so if you want to have

file1 and file2 read in that order, you'll need to specify them as such on the command

line.

The Program File

With awk's automatic type conversion, a file of names and a file of numbers entered in

the reverse order at the command line generate strange-looking output rather than an

error message. That is why for longer programs, it is simpler to put the program in a file

and specify the name of the file on the command line. The -f option does this. Notice that

this is an exception to the usual way UNIX handles options. Usually the options occur at

the end of a command; however, here an input file is the last parameter.

NOTE: Versions of awk that meet the POSIX awk specifications are allowed to

have multiple -f options. You can use this for running multiple programs using the same

input.

Specifying Output on the Command Line

Output from awk may be redirected to a file or piped to another program (see Chapter 4).

The command awk /^5/ {print $0} | grep 3, for example, will result in just those lines that

start with the digit five (that's what the awk part does) and also contain the digit three (the

grep command). If you wanted to save that output to a file, by contrast, you could use

awk /^5/ {print $0} > results and the file results would contain all lines prefaced by the

digit 5. If you opt for neither of these courses, the output of awk will be displayed on

your screen directly, which can be quite useful in many instances, particularly when

you're developing—or fine tuning—your awk script.

Patterns and Actions

Awk programs are divided into three main blocks; the BEGIN block, the per-statement

processing block, and the END block. Unless explicitly stated, all statements to awk

appear in the per-statement block (you'll see later where the other blocks can come in

particularly handy for programming, though).

Statements within awk are divided into two parts: a pattern, telling awk what to match,

and a corresponding action, telling awk what to do when a line matching the pattern is

found. The action part of a pattern action statement is enclosed in curly braces ({}) and

may be multiple statements. Either part of a pattern action statement may be omitted. An

action with no specified pattern matches every record of the input file you want to search

(that's how the earlier example of {print $0} worked). A pattern without an action

indicates that you want input records to be copied to the output file as they are (i.e.,

printed).

The example of /^5/ {print $0} is an example of a two-part statement: the pattern here is

all lines that begin with the digit five (the ^ indicates that it should appear at the

beginning of the line: without it the pattern would say any line that includes the digit five)

and the action is print the entire line verbatim. ($0 is shorthand for the entire line.)

Input

Awk automatically scans, in order, each record of the input file looking for each pattern

action statement in the awk program. Unless otherwise set, awk assumes each record is a

single line. (See the sections "Advanced Concepts","Multi-line Records" for how to

change this.) If the input file has blank lines in it, the blank lines count as a record too.

Awk automatically retrieves each record for analysis; there is no read statement in awk.

A programmer may also disrupt the automatic input order in of two ways: the next and

exit statements. The next statement tells awk to retrieve the next record from the input

file and continue without running the current input record through the remaining portion

of pattern action statements in the program. For example, if you are doing a crossword

puzzle and all the letters of a word are formed by previous words, most likely you

wouldn't even bother to read that clue but simply skip to the clue below; this is how the

next statement would work, if your list of clues were the input. The other method of

disrupting the usual flow of input is through the exit statement. The exit statement

transfers control to the END block—if one is specified—or quits the program, as if all the

input has been read; suppose the arrival of a friend ends your interest in the crossword

puzzle, but you still put the paper away. Within the END block, an exit statement causes

the program to quit.

An input record refers to the entire line of a file including any characters, spaces, or Tabs.

The spaces and tabs are called whitespace.

TIP: If you think that your input file may include both spaces and tabs, you

can save yourself a lot of confusion by ensuring that all tabs become spaces with the

expand program. It works like this: expand filename | awk { stuff }.

The whitespace in the input file and the whitespace in the output file are not related and

any whitespace you want in the output file, you must explicitly put there.

Fields

A group of characters in the input record or output file is called a field. Fields are

predefined in awk: $1 is the first field, $2 is the second, $3 is the third, and so on. $0

indicates the entire line. Fields are separated by a field separator (any single character

including Tab), held in the variable FS. Unless you change it, FS has a space as its value.

FS may be changed by either starting the programfile with the following statement:

BEGIN {FS = "char" }

or by setting the -Fchar command line option where char is the selected field separator

character you want to use.

One file that you might have viewed which demonstrates where changing the field

separator could be helpful is the /etc/passwd file that defines all user accounts. Rather

than having the different fields separated by spaces or tabs, the password file is structured

with lines:

news:?:6:11:USENET News:/usr/spool/news:/bin/ksh

Each field is separated by a colon! You could change each colon to a space (with sed, for

example), but that wouldn't work too well: notice that the fifth field, USENET News,

contains a space already. Better to change the field separator. If you wanted to just have a

list of the fifth fields in each line, therefore, you could use the simple awk command awk

-F: {print $5} /etc/passwd.

Likewise, the built-in variable OFS holds the value of the output field separator. OFS also

has a default value of a space. It, too, may be changed by placing the following line at the

start of a program.

BEGIN {OFS = "char" }

If you want to automatically translate the passwd file so that it listed only the first and

fifth fields, separated by a tab, you can therefore use the awk script:

BEGIN { FS=":" ; OFS=" " }

{ print $1, $5 }

Notice here that the script contains two blocks: the BEGIN block and the main per-input

line block. Also notice that most of the work is done automatically.

Program Format

With a few noted exceptions, awk programs are free format. The interpreter ignores any

blank lines in a programfile. Add them to improve the readability of your program

whenever you wish. The same is true for Tabs and spaces between operators and the parts

of a program. Therefore, these two lines are treated identically by the awk interpreter.

$4 == 2 {print "Two"}

$4 == 2 { print "Two" }

If more than one pattern action line appears on a line, you'll need to separate them with a

semicolon, as shown above in the BEGIN block for the passwd file translator. If you stick

with one-command-per-line then you won't need to worry too much about the

semicolons. There are a couple of spots, however, where the semicolon must always be

used: before an else statement or when included in the syntax of a statement. (See the

"Loops" or "The Conditional Statement" sections.) However, you may always put a

semicolon at the end of a statement.

The other format restriction for awk programs is that at least the opening curly bracket of

the action half of a pattern action statement must be on the same line as the

accompanying pattern, if both pattern and action exist. Thus, following examples all do

the same thing.

The first shows all statements on one line:

$2==0 {print ""; print ""; print "";}

The second with the first statement on the same line as the pattern to match:

$2==0 { print ""

print ""

print ""}

and finally as spread out as possible:

$2==0 {

print ""

print ""

print ""

}

When the second field of the input file is equal to 0, awk prints three blank lines to the

output file.

NOTE: Notice that print "" prints a blank line to the output file, whereas the

statement print alone prints the current input line.

When you look at an awk program file, you may also find commentary within. Anything

typed from a # to the end of the line is considered a comment and is ignored by awk.

They are notes to anyone reading the program to explain what is going on in words, not

computerese.

A Note on awk Error Messages

Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity of

the program, a typo is easily found. Not all errors are as obvious; I have scattered some

examples of errors throughout this chapter.

Print Selected Fields

Awk includes three ways to specify printing. The first is implied. A pattern without an

action assumes that the action is to print. The two ways of actively commanding awk to

print are print and printf(). For now, I am going to stick to using only implied printing

and the print statement. printf is discussed in a later section ("Input/Output") and is used

mainly for precise output. This section demonstrates the first two types of printing

through some step-by-step examples.

Program Components

If I want to be sure the System Administrator spelled my name correctly in the

/etc/password file, I enter an awk command to find a match but omit an action. The

following command line puts a list on-screen.

$ awk '/Ann/' /etc/passwd

amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh

andhs26:0TFnZSVwcua3Y:2488:23:DeAnn

O'Neal:/usr/lstudent/andhs26:/bin/csh

alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh

cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann

McIntyre:/usr/lteach/cmcintyr:/bin/csh

jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn

Flanagan:/usr/lteach/jflanaga:/bin/csh

lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz,

:/usr/lteach/lschultz:/bin/csh

akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh

bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker

:/usr/bakehs59:/bin/csh

ahernan:AZZPQNCkw6ffs:3144:23:Ann

Hernandez:/usr/lstudent/ahernan:/bin/csh

$ _

I look on the monitor and see the correct spelling.

ERROR NOTE: For the sake of making a point, suppose I had chosen the pattern

/Anne/. A quick glance above shows that there would be no matches. Entering awk

'/Anne/' /etc/passwd will therefore produce nothing but another system prompt to the

monitor. This can be confusing if you expect output. The same goes the other way;

above, I wanted the name Ann, but the names LeAnn, Annie and DeAnna matched, too.

Sometimes choosing a pattern too long or too short can cause an unneeded headache.

TIP: If a pattern match is not found, look for a typo in the pattern you are

trying to match.

Printing specified fields of an ASCII (plain text) file is a straightforward awk task.

Because this program example is so short, only the input is in a file. The first input file,

"sales", is a file of car sales by month. The file consists of each salesperson's name,

followed by a monthly sales figure. The end field is a running total of that person's total

sales.

The Input File and Program

$cat sales

John Anderson,12,23,7,42

Joe Turner,10,25,15,50

Susan Greco,15,13,18,46

Bob Burmeister,8,21,17,46

The following command line prints the salesperson's name and the total sales for the first

quarter.

awk -F, '{print $1,$5}' sales

John Anderson 42

Joe Turner 50

Susan Greco 46

Bob Burmeister 46

A comma (,) between field variables indicates that I want OFS applied between output

fields as shown in a previous example. Remember without the comma, no field separator

will be used, and the displayed output fields (or output file) will all run together.

TIP: Putting two field separators in a row inside a print statement creates a

syntax error with the print statement; however, using the same field twice in a single print

statement is valid syntax. For example:

awk '{print($1,$1)'

Patterns

A pattern is the first half of an awk program statement. In awk there are six accepted

pattern types. This section discusses each of the six in detail. You have already seen a

couple of them, including BEGIN, and a specified, slash-delimited pattern, in use. Awk

has many string matching capabilities arising from patterns, and the use of regular

expressions in patterns. A range pattern locates a sequence. All patterns except range

patterns may be combined in a compound pattern.

I began the chapter by saying awk was a pattern-match and process language. This

section explores exactly what is meant by a pattern match. As you'll see, what kind

pattern you can match depends on exactly how you're using the awk pattern specification

notation.

BEGIN and END

The two special patterns BEGIN and END may be used to indicate a match, either before

the first input record is read, or after the last input record is read, respectively. Some

versions of awk require that, if used, BEGIN must be the first pattern of the program and,

if used, END must be the last pattern of the program. While not necessarily a

requirement, it is nonetheless an excellent habit to get into, so I encourage you to do so,

as I do throughout this chapter. Using the BEGIN pattern for initializing variables is

common (although variables can be passed from the command line to the program too;

see "Command Line Arguments") The END pattern is used for things which are input￾dependent such as totals.

If I want to know how many lines are in a given program, I type the following line:

$awk 'END {print _Total lines: _$NR}' myprogram

I see Total lines: 256 on the monitor and therefore know that the file myprogram has 256

lines. At any point while awk is processing the file, the variable NR counts the number of

records read so far. NR at the end of a file has a value equal to the number of lines in the

file.

How might you see a BEGIN block in use? Your first thought might be to initialize

variables, but if it's a numeric value, it's automatically initialized to zero before its first

use. Instead, perhaps you're building a table of data and want to have some columnar

headings. With this in mind, here's a simple awk script that shows you all the accounts

that people named Dave have on your computer:

BEGIN {

FS=_:_ # remember that the passwd file uses colons

OFS=_ _ # we_re setting the output to a TAB

print _Account_,_Username_

}

/Dav/ {print $1, $5}

Here's what it looks like in action (we've called this file _daves.awk_, though the

program matches Dave and David, of course):

$ awk -f daves.awk /etc/passwd

Account Username

andrews Dave Andrews

d3 David Douglas Dunlap

daves Dave Smith

taylor Dave Taylor

Note that you could also easily have a summary of the total number of matched accounts

by adding a variable that's incremented for each match, then in the END block output in

some manner. Here's one way to do it:

BEGIN { FS=_:_ ; OFS=_ _ # input colon separated, output tab

separated

print _Account_,_Username_

}

/Dav/ {print $1, $5 ; matches++ }

END { print _A total of _matches_ matches._}

Tải ngay đi em, còn do dự, trời tối mất!