Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

Trang chủ

Đăng nhập

Đăng ký

Mới

Đăng ký tài khoản mới

AI Tư vấn

Mới

Trợ lý thông minh tìm tài liệu

Liên hệ fanpage

Hỗ trợ tìm tài liệu

Lưu trang

Liên hệ fanpage

Tài liệu Dive Into Python-Chapter 8. HTML Processing doc

MIỄN PHÍ

Số trang

Kích thước

212.9 KB

Định dạng

PDF

Lượt xem

1159

Tài liệu đang bị lỗi

File tài liệu này hiện đang bị hỏng, chúng tôi đang cố gắng khắc phục.

Tài liệu Dive Into Python-Chapter 8. HTML Processing doc

Nội dung xem thử

Mô tả chi tiết

Chapter 8. HTML Processing

8.1. Diving in

I often see questions on comp.lang.python like “How can I list all the

[headers|images|links] in my HTML document?” “How do I

parse/translate/munge the text of my HTML document but leave the tags

alone?” “How can I add/remove/quote attributes of all my HTML tags at

once?” This chapter will answer all of these questions.

Here is a complete, working Python program in two parts. The first part,

BaseHTMLProcessor.py, is a generic tool to help you process HTML files

by walking through the tags and text blocks. The second part, dialect.py, is

an example of how to use BaseHTMLProcessor.py to translate the text of an

HTML document but leave the tags alone. Read the doc strings and

comments to get an overview of what's going on. Most of it will seem like

black magic, because it's not obvious how any of these class methods ever

get called. Don't worry, all will be revealed in due time.

Example 8.1. BaseHTMLProcessor.py

If you have not already done so, you can download this and other examples

used in this book.

from sgmllib import SGMLParser

import htmlentitydefs

class BaseHTMLProcessor(SGMLParser):

def reset(self):

# extend (called by SGMLParser.__init__)

self.pieces = []

SGMLParser.reset(self)

def unknown_starttag(self, tag, attrs):

# called for each start tag

# attrs is a list of (attr, value) tuples

# e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]

# Ideally we would like to reconstruct original tag and attributes, but

# we may end up quoting attribute values that weren't quoted in the

source

# document, or we may change the type of quotes around the attribute

value

# (single to double quotes).

# Note that improperly embedded non-HTML code (like client-side

Javascript)

# may be parsed incorrectly by the ancestor, causing runtime script

errors.

# All non-HTML code must be enclosed in HTML comment tags (<!--

code -->)

# to ensure that it will pass through this parser unaltered (in

handle_comment).

strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])

self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

def unknown_endtag(self, tag):

# called for each end tag, e.g. for </pre>, tag will be "pre"

# Reconstruct the original end tag.

self.pieces.append("</%(tag)s>" % locals())

def handle_charref(self, ref):

# called for each character reference, e.g. for " ", ref will be

"160"

# Reconstruct the original character reference.

self.pieces.append("&#%(ref)s;" % locals())

def handle_entityref(self, ref):

# called for each entity reference, e.g. for "©", ref will be "copy"

# Reconstruct the original entity reference.

self.pieces.append("&%(ref)s" % locals())

# standard HTML entities are closed with a semicolon; other entities

are not

if htmlentitydefs.entitydefs.has_key(ref):

self.pieces.append(";")

def handle_data(self, text):

# called for each block of plain text, i.e. outside of any tag and

# not containing any character or entity references

# Store the original text verbatim.

self.pieces.append(text)

def handle_comment(self, text):

# called for each HTML comment, e.g. <!-- insert Javascript code here

-->

# Reconstruct the original comment.

# It is especially important that the source document enclose client-side

# code (like Javascript) within comments so it can pass through this

# processor undisturbed; see comments in unknown_starttag for details.

self.pieces.append("" % locals())

def handle_pi(self, text):

# called for each processing instruction, e.g. <?instruction>

# Reconstruct original processing instruction.

self.pieces.append("<?%(text)s>" % locals())

def handle_decl(self, text):

# called for the DOCTYPE, if present, e.g.

# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01

Transitional//EN"

# "http://www.w3.org/TR/html4/loose.dtd">

# Reconstruct original DOCTYPE

self.pieces.append("<!%(text)s>" % locals())

def output(self):

"""Return processed HTML as a single string"""

return "".join(self.pieces)

Example 8.2. dialect.py

import re

Tài liệu tương tự (6)

Xem tất cả

PREMIUM

5439 lượt xem

Tài liệu Dive Into Python-Chapter 10. Scripts and Streams docx

Xem chi tiết

PREMIUM

7770 lượt xem

Tài liệu Dive Into Python-Chapter 11. HTTP Web Services doc

Xem chi tiết

PREMIUM

3885 lượt xem

Tài liệu Planning for Diversity - Options and Recommendations for DoD Leaders docx

Xem chi tiết

MIỄN PHÍ

4662 lượt xem

Tài liệu Báo cáo khoa học: Evolutionary divergence of valosin-containing protein ⁄cell division

Xem chi tiết

MIỄN PHÍ

10101 lượt xem

Tài liệu Báo cáo Y học: Structural diversity and transcription of class III peroxidases from

Xem chi tiết

PREMIUM

3885 lượt xem

Tài liệu ACCOUNTING FOR HETEROGENEITY, DIVERSITY AND GENERAL EQUILIBRIUM IN EVALUATING SOCIAL

Xem chi tiết

Tải ngay đi em, còn do dự, trời tối mất!