Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Tài liệu Python & XML doc
PREMIUM
Số trang
357
Kích thước
3.5 MB
Định dạng
PDF
Lượt xem
1991

Tài liệu Python & XML doc

Nội dung xem thử

Mô tả chi tiết

Copyright

Full Description

About the Author

Python & XML

Christopher A. Jones

Fred L. Drake, Jr.

Publisher: O'Reilly

First Edition January 2002

ISBN: 0-596-00128-2, 384 pages

Python is an ideal language for manipulating XML, and this new volume

gives you a solid foundation for using these two languages together.

Complete with practical examples that highlight common application tasks,

the book starts with the basics then quickly progresses to complex topics

like transforming XML with XSLT and querying XML with XPath. It also

explores more advanced subjects, such as SOAP and distributed web

services.

Dedication

Preface

Audience

Organization

Conventions Used in This Book

How to Contact Us

Acknowledgments

1. Python and XML

1.1 Key Advantages of XML

1.2 The XML Specifications

1.3 The Power of Python and XML

1.4 What Can We Do with It?

2. XML Fundamentals

2.1 XML Structure in a Nutshell

2.2 Document Types and Schemas

2.3 Types of Conformance

2.4 Physical Structures

2.5 Constructing XML Documents

2.6 Document Type Definitions

2.7 Canonical XML

2.8 Going Beyond the XML Specification

3. The Simple API for XML

3.1 The Birth of SAX

3.2 Understanding SAX

3.3 Reading an Article

3.4 Searching File Information

3.5 Building an Image Index

3.6 Converting XML to HTML

3.7 Advanced Parser Factory Usage

3.8 Native Parser Interfaces

4. The Document Object Model

4.1 The DOM Specifications

4.2 Understanding the DOM

4.3 Python DOM Offerings

4.4 Retrieving Information

4.5 Changing Documents

4.6 Building a Web Application

4.7 Going Beyond SAX and DOM

5. Querying XML with XPath

5.1 XPath at a Glance

5.2 Where Is XPath Used?

5.3 Location Paths

5.4 XPath Arithmetic Operators

5.5 XPath Functions

5.6 Compiling XPath Expressions

6. Transforming XML with XSLT

6.1 The XSLT Specification

6.2 XSLT Processors

6.3 Defining Stylesheets

6.4 Using XSLT from the Command Line

6.5 XSLT Elements

6.6 A More Complex Example

6.7 Embedding XSLT Transformations in

Python

6.8 Choosing a Technique

7. XML Validation and Dialects

7.1 Working with DTDs

7.2 Validation at Runtime

7.3 The BillSummary Example

7.4 Dialects, Frameworks, and Workflow

7.5 What Does ebXML Offer?

8. Python Internet APIs

8.1 Connecting Web Sites

8.2 Working with URLs

8.3 Opening URLs

8.4 Connecting with HTTP

8.5 Using the Server Classes

9. Python, Web Services, and SOAP

9.1 Python Web Services Support

9.2 The Emerging SOAP Standard

9.3 Python SOAP Options

9.4 Example SOAP Server and Client

9.5 What About XML-RPC?

10. Python and Distributed Systems Design

10.1 Sample Application and Flow Analysis

10.2 Understanding the Scope

10.3 Building the Database

10.4 Building the Profiles Access Class

10.5 Creating an XML Data Store

10.6 The XML Switch

10.7 Running the XML Switch

10.8 A Web Application

A. Installing Python and XML Tools

A.1 Installing Python

A.2 Installing PyXML

A.3 Installing 4Suite

B. XML Definitions

B.1 XML Definitions

C. Python SAX API

D. Python DOM API

D.1 4DOM Extensions

E. Working with MSXML3.0

E.1 Setting Up MSXML3.0

E.2 Basic DOM Operations

E.3 MSXML3.0 Support for XSLT

E.4 Handling Parsing Errors

E.5 MSXML3.0 Reference

F. Additional Python XML Tools

F.1 Pyxie

F.2 Python XML Tools

F.3 XML Schema Validator

F.4 Sab-pyth

F.5 Redfoot

F.6 XML Components for Zope

F.7 Online Resources

Colophon

Dedication

We would like to dedicate this book to Frank Willison, O'Reilly Editor-in￾Chief and Python Champion

——Christopher A. Jones and Fred L. Drake, Jr.

Frank will be remembered in the Python community for the several great

Python books that he made possible, memories of his participation in

many Python conferences, and his Frankly Speaking columns. The Python

world (and the world at large) won't be the same without Frank.

——Guido van Rossum, Python creator

Preface

This book comes to you as a result of the collaboration of two authors who became

interested in the topic in very different ways. Hopefully our motivations will help you

understand what we each bring to the book, and perhaps prove to be at least a little

entertaining as well.

Chris Jones started using XML several years ago, and began using Python more recently.

As a consultant for major companies in the Seattle area, he first used XML as the core

data format for web site content in a home-grown publishing system in 1997. But he

really became an XML devotee when developing an open source engine, which

eventually became the key technology for Planet 7 Technologies. As a consultant, he

continues to use XML on an almost daily basis for everything from configuration files to

document formats.

Chris began dabbling in Python because he thought it was a clean, object-oriented

alternative to Perl. A long-time Unix user (but one who frequently finds himself working

with Windows in Seattle), he has grown accustomed to scripting languages that place the

full Unix API in the hands of developers. Having used far too much Java and ASP in web

development over the years, he found Python a refreshing way to keep object-orientation

while still accessing Unix sockets and threads—all with the convenience of a scripting

language.

The combination of Python and XML brings great power to the developer. While XML is

a potent technology, it requires the programmer to use objects, interfaces, and strings.

Python does so as well, and therefore provides an excellent playpen for XML

development. The number of XML tools for Python is growing all the time, and Chris

can produce an XML solution in far less time using Python than he can with Java or C++.

Of course, the cross-platform nature of Python keeps our work consistently usable

whether we're developing on Windows, Linux, or a Unix variant—the combination of

which we both seem to find powerful.

Fred Drake came to Python and XML from a different avenue, arriving at Python before

XML. He discovered Python while in graduate school experimenting with a number of

programming languages. After recognizing Python as an excellent language for rapid

development, he convinced his advisors that he should be able to write his masters

project using Python. In the course of developing the project, he became increasingly

interested in the Python community. He then made his first contributions to the Python

standard library, and in so doing became noticed by a group of Python programmers

working on distributed systems projects at the research organization of CNRI. The group

was led by Guido van Rossum, the creator of Python. Fred joined the team and learned

more about distributed systems and gluing systems together than he ever expected

possible, and he loved it.

While still in graduate school, Fred argued that Python's documentation should be

converted to a more structured language called SGML. After a few years at CNRI, he

began to do just that, and was able to sink his teeth into the documentation more

vigorously. The SGML migration path eventually changed to an XML migration path as

XML acceptance grew. Though that goal has not yet been achieved (he is still working on

it), Fred has substantially changed the way the documentation is maintained, and it now

represents one of the most structured applications of the typesetting and document

markup system developed by Donald Knuth and Leslie Lamport.

Over time, the team from CNRI became increasingly focused on the development of

Python, and moved on to form PythonLabs. Fred remained active in XML initiatives

around Python and pushed to add XML support to the standard library. Once this was

achieved, he returned to the task of migrating the Python documentation to XML, and

hopes to complete this project soon.

Audience

This book is for anyone interested in learning about using Python to build XML

applications. The bulk of the material is suited for programmers interested in using XML

as a data interchange format or as a transformable format for web content, but the first

half of the book is also useful to those interested in building more document-oriented

applications.

We do not assume that you know anything about XML, but we do assume that you have

looked at Python enough that you are comfortable reading straightforward Python code;

however, you do not need to be a Python guru. If you do not know at least a little Python,

please consult one of the many excellent books that introduce the language, such as

Learning Python, by Mark Lutz and David Ascher and Lutz (O'Reilly, 1999). For the

sections where web applications are developed, it helps to be familiar with general

concepts related to web operations, such as HTTP and HTML forms, but sufficient

information is included to get you started with basic CGI scripting.

Organization

This book is divided into ten chapters and six appendixes, as follows:

Chapter 1

This chapter offers a broad overview of XML and why Python is particularly

well-suited to XML processing.

Chapter 2

This chapter provides a good introduction to XML for newcomers and a refresher

for programmers who have some familiarity with the standard.

Chapter 3

This chapter gives a detailed introduction to using Python with the SAX interface,

for generating parse events from an XML data stream.

Chapter 4

This chapter provides an introduction to working with DOM, which is the

dominant object-oriented, tree-based API to an XML document.

Chapter 5

This chapter discusses using a traversal language to extract portions of documents

that meet your application's requirements.

Chapter 6

This chapter details using XSLT to perform transformations on XML documents.

Chapter 7

This chapter discusses validating XML generated from other sources.

Chapter 8

This chapter provides an overview of Python's high-level support for Internet

protocols, including tools for building both clients and servers for HTTP.

Chapter 9

This chapter offers discussion of and examples showing how to build and use web

services with Python.

Chapter 10

This chapter is an extended example that shows a variety of approaches to

applying Python in constructing an XML-based distributed system.

Appendix A

This appendix provides instructions on installing Python and the major XML

packages used throughout this book.

Appendix B

This appendix gives a list of definitions from the XML specification and a Python

script to extract them from the specification itself.

Appendix C

This appendix offers detailed API information for using the dominant event-based

XML interface in Python.

Appendix D

This appendix provides detailed interface documentation for using the standard

tree-oriented API for XML from Python.

Appendix E

This appendix gives information on Microsoft's XML libraries available for

Python.

Appendix F

This appendix is a summary of the many additional tools that are available for

using XML with Python, and a list of starting points for additional information on

the Web.

Conventions Used in This Book

The following typographical conventions are used throughout this book:

Bold

Used for the occasional reference to labels in graphical user interfaces, as well as

user input.

Italic

Used for commands, URLs, filenames, file extensions, directory or folder names,

emphasis, and new terms where they are defined.

Constant width

Used for constructs from programming languages, HTML, and XML, both within

running text and in listings.

Constant width italic

Used for general placeholders that indicate that an item should be replaced by

some actual value in your own program. Most importantly, this font is used for

formal parameters when discussing the signatures of API methods.

How to Contact Us

We have tested and verified all the information in this book to the best of our abilities,

but you may find that features have changed or that we have let errors slip through the

production of the book. Please let us know of any errors that you find, as well as

suggestions for future editions, by writing to:

O'Reilly & Associates, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

1-800-998-9938 (in the United States or Canada)

1-707-829-0515 (international/local)

1-707-829-0104 (fax)

You can also send us messages electronically. To be put on the mailing list or to request a

catalog, send email to:

[email protected]

To ask technical questions or comment on the book, send email to:

[email protected]

We have a web site for the book, where we'll list examples, errata, and any plans for

future editions. You can access this page at:

http://www.oreilly.com/catalog/pythonxml/

For more information about this book and others, see the O'Reilly web site:

http://www.oreilly.com/

Acknowledgments

While it is impossible to individually acknowledge everyone that had a hand in getting

this book from an idea to the printed work you now hold in your hand, we would like to

recognize and thank a few of these special people.

We are both very grateful for the support of our families, without which this would not

have even gotten started. Chris would like to thank his family (Barb, Miles, and

Katherine); without their support he would never get any writing completed, ever. Fred

owes a great deal of gratitude to his wife (Cathy), who spent many a lonely evening

wondering if he'd remember to come to bed. His children (William, Christopher, and

Erin) made sure he didn't forget why he spends so much time on all this. Those late-night

trips to the coffee shop with Erin will never be forgotten!

We'd especially like to thank Guido van Rossum and Fred's compatriots at PythonLabs

(Tim Peters, Jeremy Hylton, and Barry Warsaw) for making sure Python could grow to

be such a wonderful tool for building applications, and for leading the incredible

community efforts which have gone into both Python itself and the excellent selection of

additional packages of Python code.

Python's development has been beleaguered by regular employment changes, but we all

owe a debt of gratitude to the employers of the contributors and the PythonLabs team.

Now at Zope Corporation (formerly Digital Creations), PythonLabs has finally found a

home that offers both a rich environment for Python and comfortable place to settle

down. Previous employers of Python's lead developers, including the Corporation for

National Research Initiatives (CNRI) and Stichting Mathematisch Centrum, deserve

credit for allowing Python to germinate and blossom.

Our reviewers' efforts were invaluable and made this book what it is today. (They were

helpful, and showed great faith in our ability to pull this off, even when we weren't so

sure.) Martin von Löwis, Paul Prescod, Simon St.Laurent, Greg Wilson, and Frank

Willison all contributed generously of their time and helped to ensure that our mistakes

were noticed. The feedback they provided, both from a development and from a technical

support perspective, was invaluable. Any mistakes in the finished book are our own. Fred

Drake, who began working on this project as a technical reviewer, must still answer for

any mistakes he's introduced!

Many people at O'Reilly played an important part in the development of this book, and

without the help of their editorial staff, this book would seem rambling and incoherent

(well, more so at least!). Laura Lewin deserves special recognition. Without her editorial

skill and faith in our ability to present the important aspects of our subject, you wouldn't

be reading this; her penchant for reminding us of the big picture when we became mired

in the particulars of topics kept us on track and focused. Frank Willison deserves a great

deal of credit not only for bringing Laura to O'Reilly, but in shepherding O'Reilly's

efforts to bring together their line of books on Python; we'll all miss him. Finally, we'd

like to thank the production staff at O'Reilly for their hard work in getting the book to

print.

Chapter 1. Python and XML

Python and XML are two very different animals, each with a rich history. Python is a

full-scale programming language that has grown from scripting world roots in a very

organic way, through the vision and guidance of Python's inventor, Guido van Rossum.

Guido continues to take into account the needs of Python developers as Python matures.

XML, on the other hand, though strongly impacted by the ideas of a small cadre of

visionaries, has grown from standards-committee roots. It has seen both quiet adoption

and wrenching battles over its future. Why bother putting the two technologies together?

Before the Python/XML combination, there seemed no easy or effective way to work

with XML in a distributed environment. Developers were forced to rely on a variety of

tools used in awkward combination with one other. We used shell scripting and Perl to

process text and interact with the operating system, and then used Java XML API's for

processing XML and network programming. The shell provided an excellent means of

file manipulation and interaction with the Unix system, and Perl was a good choice for

simple text manipulation, providing access to the Unix APIs. Unfortunately, neither

sported a sophisticated object model. Java, on the other hand, featured an object-oriented

environment, a robust platform API for network programming, threads, and graphical

user interface (GUI) application development. But with Java, we found an immediate

lack of text manipulation power; scripting languages typically provided strong text

processing. Python presented a perfect solution, as it combines the strengths of all of

these various options.

Like most scripting languages, Python features excellent text and file manipulation

capabilities. Yet, unlike most scripting languages, Python sports a powerful object-

oriented environment with a robust platform API for network programming, threads, and

graphical user interface development. It can be extended with components written in C

and C++ with ease, allowing it to be connected to most existing libraries. To top it off,

Python has been shown to be more portable than other popular interpreted languages,

running comfortably on platforms ranging from massive parallel Connection Machines to

personal digital assistants and other embedded systems. As a result, Python is an

excellent choice for XML programming and distributed application development.

It could be said that Python brings sanity and robustness to the scripting world, much in

the same way that Java once did to the C++ world. As always, there are trade-offs. In

moving from C++ to Java, you find a simpler language with stronger object-oriented

underpinnings. Changing to a simpler language further removed from the low-level

details of memory management and the hardware, you gain robustness and an improved

ability to locate coding errors. You also encounter a rich API equipped with easy thread

management, network programming, and support for Internet technologies and protocols.

As may be expected, this flexibility comes at a cost: you also encounter some reduced

performance when comparing it with languages such as C and C++.

Likewise, when choosing a scripting language such as Python over C, C++, or even Java,

you do make some concessions. You trade performance for robustness and for the ability

to develop more rapidly. In the area of enterprise and Internet systems development,

choosing reliable software, flexible design, and rapid growth and deployment are factors

that outweigh the performance gains you might get by using a language such as C++. If

you do need some of the performance back, you can still implement speed-sensitive

components of your application in C or C++, but you can avoid doing so until you have

profiling data to help you pinpoint what is really a problem and what only might be a

problem. (How to perform the analysis and write extensions in C/C++ is a topic for other

books.)

Regardless of your feelings on scripting languages, Java, or C++, this book focuses on

XML and the Python language. For those who are new to XML, we will start with an

overview of why it is interesting, and then we'll move on to using it from Python and

seeing how we make our XML applications easier to create.

1.1 Key Advantages of XML

XML has a few key advantages that make it the data language of choice on the Internet.

These advantages were designed into XML from the beginning, and, in fact, are what

make it so appealing to Internet developers.

1.1.1 Application Neutrality

First, XML is both human- and machine-readable. This is not a subtle point. Have you

ever tried to read a Microsoft Word document with a text editor? You can't if it was saved

as a .doc file, because the information in a .doc document is in a binary (computer

readable only) format, even though most Word documents primarily consist of text. A

Word document cannot be shared with any other application besides Word—unless that

application has been taught the intricacies of Word's binary format. In this case, the

application must also be taught to expect changes in Word's format each time there is a

new release from Microsoft.

This sounds annoying for the developer, but how bad is it, really? After all, Word is

incredibly popular, so it must not be too hard to figure out. Let's look at the top of the

Word file that contains this chapter:

Ï_ࡱ_á > _ ÿ _ _ B_ _ D_ _

ÿÿÿ ?_ @_ A_ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á 7 _ _¿ _ _ >_ _

bjbjU_U_ __ 0¸_ 7| 7| W_ _ C

ÿÿ_ ÿÿ_ ÿÿ_ l Ê_

Ê_ Ê_ Ê_ Ê_ Ê_ Ê_ ¶ _

This certainly looks familiar to anyone who has ever opened a Word file with a text

editor. We don't see our recognizable text (the content we intended) so we must assume it

is buried deep in the file. Determining what the true content is and where it is can be

difficult, but it shouldn't be. It is our data, after all. Let's try another supported format:

"Rich Text Format," or RTF. Unlike the .doc file, this format is text-based, and should

therefore be a bit easier to decipher. We search down in the file to find the start of our

text:

\par }\pard \s34\qr

\li0\ri0\sb80\sa480\sl240\slmult0\widctlpar\aspalpha\aspnum\faauto\out

linelevel0\widctlpar\aspalpha\aspnum\faauto\outlinelevel0\pnrauth1\pnr

date-967302179\pnrnot1\adjustright\rin0\lin0\itap0 {\b0\fs48 Combining

Python and XML}{

\b0\deleted\fs48\revauthdel1\revdttmdel-2041034726 Fundamentals}{\b0\f

s48\revised\revauth1\revdttm-2041034726 ?}{\b0\fs48

\par }\pard\plain \qj

This is better. The chapter title is visible, so we can try to decipher the structure from that

point forward. The markup appears to be complex, and there's a hint of an old version of

the chapter title. To extract the text we actually want, we need to understand the Word

model for revision tracking, which still presents many challenges.

XML, on the other hand, is application-neutral. In other words, an XML document is

usually processed by an XML parser or processor, but if one is not available, an XML

document can be easily read and parsed. Data kept in XML is not trapped within the

constraints of one particular software application. The ability to read rich data files can

become very valuable when, for example, 20 years from now, you dig up a CD-ROM of

old business forms that you suddenly find you need again. Will QuickBooks still allow

you to extract this same data in 2021? With XML, you can read the data with any text

editor.

Let's look at this chapter in XML. Using markup from a common document type for

software manuals and documentation (DocBook), it appears somewhat verbose, and

doesn't include change-tracking information, but we can identify the text quite easily

now:

<chapter>

<title>Python and XML</title>

<para>Python and XML are two very different animals, each with a

rich history. Python is a full-scale programming language that has

grown

from scripting world roots, and has done so in a very organic way

Note that additional characters appear in the document (other than the document content);

these are called markup (or tags). We saw this in the RTF version of the document as

well, but there were many more bits of text that were difficult to decipher, and we can

reasonably surmise that the strange data in the MS Word document would correspond to

this in some way. Were this a book on RTF, you would quickly surmise two things: RTF

is much more like a printer control language than the example of XML we just looked at,

and writing a program that understands RTF would be quite difficult. In this book, we're

going to show you that XML can be used to define languages that fit your application,

and that creating programs that can decipher XML is not a difficult task, especially with

the help of Python.

1.1.2 Hierarchical Structure

XML is hierarchical, and allows you to choose your own tag names. This is quite

different from HTML. In XML, you are free to create elements of any type, and stack

other elements within those elements. For example, consider an address entry:

<?xml version="1.0"?>

<address>

<name>Bubba McBubba</name>

<street>123 Happy Go Lucky Ln.</street>

<city>Seattle</city><state>WA</state><zip>98056</zip>

</address>

In the above well-formed XML code, I came up with a few record names and then

lumped them together with data. XML processing software, such as a parser (which you

use to interpret the syntactic constructs in an XML document), would be able to represent

this data in many ways, because its structure has been communicated. For example, if we

were to look at what an application programmer might write in source code, we could

turn this record into an object initialized this way:

addr = Address( )

addr.name = "Bubba McBubba"

addr.street = "123 Happy Go Lucky Ln."

addr.city = "Seattle"

addr.state = "WA"

addr.zip = "98056"

This approach makes XML well-suited as a format for many serialized objects. (There

are some constructs for which XML is not so well suited, including many formats for

large numerical datasets used in scientific computing.) XML's hierarchical structure

makes it easy to apply the concept of object interfaces to documents—it's quite simple to

build application-specific objects directly from the information stream, given mappings

from element names to object types. We later see that we can model more than simple

hierarchical structures with XML.

1.1.3 Platform Neutrality

Remember that XML is cross-platform. While this is mainly a feature of its text-based

format, it's still very much true. The use of certain text encodings ensures that there are

no misconceptions among platforms as to the arrangement of an XML document.

Therefore, it's easy to pass an XML purchase order from a Unix machine to a wireless

personal digital assistant. XML is designed for use in conjunction with existing Internet

infrastructure using HTTP, SSL, and other messaging protocols as they evolve. These

qualities make XML lend itself to distributed applications; it has been successfully used

as a foundation for message queuing systems, instant messaging applications, and remote

procedure call frameworks. We examine these applications further in Chapter 9 and

Chapter 10. It also means that the document example given earlier is more than simply

application-neutral, and can be readily moved from one type of machine to another

without loss of information. A chapter of a technical book can be written by a

programmer on his or her favorite flavor of Unix, and then sent to a publisher using book

composition software on a Macintosh. The many difficult format conversions can be

avoided.

1.1.4 International Language Support

As the Internet becomes increasingly pervasive in our daily lives, we become more aware

of the world around us — it is a culture-rich and diversified place. As technologists,

however, we are still learning the significance of making our software work in ways that

supports more than one language at a time; making our text-processing routines "8-bit

safe" is not only no longer sufficient, it's no longer even close.

Standards bodies all over the world have come up with ways that computers can

interchange text written in their national languages, and sometimes they've come up with

several, each having varying degrees of acceptance. Unfortunately, most applications do

not include information about which language or interchange standard their data is

written in, so it is difficult to share information across the cultural and linguistic

boundaries the different standards represent. Sometimes it is difficult to share information

within such boundaries if multiple standards are prominent.

The difficulties are compounded by very substantial cultural differences that present

themselves about how text is handled. There are many different writing systems in

addition to the western European left-to-right, top-to-bottom style in which this book is

written; right-to-left is not uncommon, and top-to-bottom "lines" of text arranged right￾to-left on the page is used in China. Hebrew uses a right-to-left writing system, but

numbers are written using Arabic numerals from left to right. Other systems support

textual annotations written in parallel with the text. Consider what happens when a

document includes text from different writing systems!

Standards bodies are aware of this problem, and have been working on solutions for

years. The editors of the XML specification have wisely avoided proposing new solutions

to most of these issues, and are instead choosing to build on the work of experts on the

topic and existing standards.

The International Organization for Standardization (ISO) and the Unicode Consortium

(http://www.unicode.org/ ) have arrived at a single standard that, while not perfect, is

perhaps the most capable standard attempting to unify the world's text representations,

with the intent that all languages and alphabets (including ideographic and hieroglyphic

character sets) are representable. The standard is known as ISO/IEC 10646, or more

commonly, Unicode. Not all national standards bodies have agreed that Unicode is the

standard for all future text interchange applications, especially in Asia, but there is

widespread belief that Unicode is the best thing available to serve everyone. The standard

deals with issues including multidirectional text, capitalization rules, and encoding

algorithms that can be used to ensure various properties of data streams. The standard

does not deal specifically with language issues that are not tied intimately to character

issues. Software sensitive to natural language may still need to do a lot beyond using

Unicode to ensure proper collation of names in a particular language (or multiple

languages!). Some languages will require substantial additional support for proper text

rendering (Arabic, for instance, which requires different letterforms for characters based

on their position within a word and based on neighboring letterforms).

The World Wide Web Consortium (W3C) made a simple and masterful stroke to make it

easier to use both the older interchange standards and Unicode. It required that all XML

documents be Unicode, and specified that they must describe their own encoding in such

a way that all XML processors were able to determine what encoding the document was

written in. A few specific encodings must be recognized by all processors, so that it is

always possible to generate XML that can be read anywhere and represent all of the

world's characters. There is also a feature that allows the content of XML documents to

be labeled with the actual language it is written in, but that's not used as much as it could

be at this time.

Since XML documents are Unicode documents, the languages of the world are supported.

The use of Unicode and encodings in XML are discussed in some detail in Chapter 2.

Unicode strings have been a part of Python since Version 2.0, and the Python standard

library includes support for a large number of encodings.

1.2 The XML Specifications

In the trade press, we often see references about how XML "now supports" some

particular industry-specific application. The article that follows is often confused,

offering some small morsel of information about an industry consortium that has released

a new specification for an XML-based language to support interoperability of data within

the consortium's industry. As technical people, we usually note that it doesn't apply to the

industries we're involved in, or else it does, but the specification is too early a draft to be

useful. In fact, our managers will probably agree with us most of the time, or they'll be

privy to some relevant information that causes them to disagree. If we step up the

corporate ladder a couple more rungs, however, we often find an increase in the level of

confusion over XML. Sometimes, this is accompanied by either a call to "adopt XML"

(too often with a list of particular specifications that are not intended to be used together),

or a reaction that XML is too immature to use at all.

So we need to think about just what we can work with that will meet the following

criteria:

• It must make technical sense for our application.

• It should be sufficiently well-defined that implementation is possible.

• It must be able to be explained and justified to (at least) our direct managers.

• It won't freak out the upper management.

Ok, we're technical people, so we may have to ignore that last item; it certainly won't be

covered in this book. In fact, most of this really can't be covered in technical material.

There are many specifications in various stages of maturity, and most are specific to one

industry or another. However, we can point out what the foundation specifications are,

because those you will need regardless of your industry or other requirements.

1.2.1 XML 1.0 Recommendation

The XML specification itself is a document created and maintained by the W3C. As of

this writing, the current version is Extensible Markup Language (XML) 1.0 (Second

Edition), and is available from the W3C web site at http://www.w3.org/TR/REC-xml.

(The second edition differs from the first only in that some editorial corrections and

clarifications have been made; the specification is stable.)

XML itself is not a markup language, but a meta-language that can be used to define

specific markup languages. In this, it inherits much from SGML. The specification covers

five aspects of markup languages:

• Range of structural forms which can be marked

• Specific syntax of markup components

• A schema language used to define specific languages

• Definition of validity constraints

• Minimum requirements for processing tools

Tải ngay đi em, còn do dự, trời tối mất!