Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Tài liệu SQL Antipatterns- P5 ppt
Nội dung xem thử
Mô tả chi tiết
SOLUTION: USE THE RIGHT TOOL FOR THE JOB 201
inverted index is a list of all words one might search for. In a manyto-many relationship, the index associates these words with the text
entries that contain the respective word. That is, a word like crash can
appear in many bugs, and each bug may match many other keywords.
This section shows how to design an inverted index.
First, define a table Keywords to list keywords for which users search,
and define an intersection table BugsKeywords to establish a many-tomany relationship:
Download Search/soln/inverted-index/create-table.sql
CREATE TABLE Keywords (
keyword_id SERIAL PRIMARY KEY,
keyword VARCHAR(40) NOT NULL,
UNIQUE KEY (keyword)
);
CREATE TABLE BugsKeywords (
keyword_id BIGINT UNSIGNED NOT NULL,
bug_id BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (keyword_id, bug_id),
FOREIGN KEY (keyword_id) REFERENCES Keywords(keyword_id),
FOREIGN KEY (bug_id) REFERENCES Bugs(bug_id)
);
Next, add a row to BugsKeywords for every keyword that matches the
description text for a given bug. We can use substring-match query
to determine these matches using LIKE or regular expressions. This is
nothing more costly than the naive searching method described in the
“Antipattern” section, but we gain efficiency because we only need to
perform the search once. If we save the result in the intersection table,
all subsequent searches for the same keyword are much faster.
Next, we write a stored procedure to make it easier to search for a
given keyword.3
If the word has already been searched, the query is
faster because the rows in BugsKeywords are a list of the documents
that contain the keyword. If no one has searched for the given keyword
before, we need to search the collection of text entries the hard way.
Download Search/soln/inverted-index/search-proc.sql
CREATE PROCEDURE BugsSearch(keyword VARCHAR(40))
BEGIN
SET @keyword = keyword;
3. This example stored procedure uses MySQL syntax.
Report erratum
this copy is (P1.0 printing, May 2010)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
SOLUTION: USE THE RIGHT TOOL FOR THE JOB 202
➊ PREPARE s1 FROM 'SELECT MAX(keyword_id) INTO @k FROM Keywords
WHERE keyword = ?';
EXECUTE s1 USING @keyword;
DEALLOCATE PREPARE s1;
IF (@k IS NULL) THEN
➋ PREPARE s2 FROM 'INSERT INTO Keywords (keyword) VALUES (?)';
EXECUTE s2 USING @keyword;
DEALLOCATE PREPARE s2;
➌ SELECT LAST_INSERT_ID() INTO @k;
➍ PREPARE s3 FROM 'INSERT INTO BugsKeywords (bug_id, keyword_id)
SELECT bug_id, ? FROM Bugs
WHERE summary REGEXP CONCAT(''[[:<:]]'', ?, ''[[:>:]]'')
OR description REGEXP CONCAT(''[[:<:]]'', ?, ''[[:>]]'')';
EXECUTE s3 USING @k, @keyword, @keyword;
DEALLOCATE PREPARE s3;
END IF;
➎ PREPARE s4 FROM 'SELECT b.* FROM Bugs b
JOIN BugsKeywords k USING (bug_id)
WHERE k.keyword_id = ?';
EXECUTE s4 USING @k;
DEALLOCATE PREPARE s4;
END
➊ Search for the user-specified keyword. Return either the integer
primary key from Keywords.keyword_id or null if the word has not
been seen previously.
➋ If the word was not found, insert it as a new word.
➌ Query for the primary key value generated in Keywords.
➍ Populate the intersection table by searching Bugs for rows containing the new keyword.
➎ Finally, query the full rows from Bugs that match the keyword_id,
whether the keyword was found or had to be inserted as a new
entry.
Now we can call this stored procedure and pass the desired keyword.
The procedure returns the set of matching bugs, whether it has to
calculate the matching bugs and populate the intersection table for a
new keyword or whether it simply benefits from the result of an earlier
search.
Download Search/soln/inverted-index/search-proc.sql
CALL BugsSearch('crash');
Report erratum
this copy is (P1.0 printing, May 2010)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
SOLUTION: USE THE RIGHT TOOL FOR THE JOB 203
There’s another piece to this solution: we need to define a trigger to
populate the intersection table as each new bug is inserted. If you need
to support edits to bug descriptions, you may also have to write a trigger
to reanalyze text and add or delete rows in the BugsKeywords table.
Download Search/soln/inverted-index/trigger.sql
CREATE TRIGGER Bugs_Insert AFTER INSERT ON Bugs
FOR EACH ROW
BEGIN
INSERT INTO BugsKeywords (bug_id, keyword_id)
SELECT NEW.bug_id, k.keyword_id FROM Keywords k
WHERE NEW.description REGEXP CONCAT('[[:<:]]', k.keyword, '[[:>:]]')
OR NEW.summary REGEXP CONCAT('[[:<:]]', k.keyword, '[[:>:]]');
END
The keyword list is populated naturally as users perform searches, so
we don’t need to fill the keyword list with every word found in the
knowledge-base articles. On the other hand, if we can anticipate likely
keywords, we can easily run a search for them, thus bearing the initial
cost of being the first to search for each keyword so that doesn’t fall on
our users.
I used an inverted index for my knowledge-base application that I described at the start of this chapter. I also enhanced the Keywords table
with an additional column num_searches. I incremented this column
each time a user searched for a given keyword so I could track which
searches were most in demand.
You don’t have to use SQL to solve every problem.
Report erratum
this copy is (P1.0 printing, May 2010)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Enita non sunt multiplicanda praeter necessitatem
(Latin, “entities are not to be multiplied beyond necessity”).
William of Ockham
Chapter 18
Spaghetti Query
Your boss is on the phone with his boss, and he waves to you to come
over. He covers his phone receiver with his hand and whispers to you,
“The executives are in a budget meeting, and we’re going to have our
staff cut unless we can feed my VP some statistics to prove that our
department keeps a lot of people busy. I need to know how many products we work on, how many developers fixed bugs, the average bugs
fixed per developer, and how many of our fixed bugs were reported by
customers. Right now!”
You leap to your SQL tool and start writing. You want all the answers at
once, so you make one complex query, hoping to do the least amount
of duplicate work and therefore produce the results faster.
Download Spaghetti-Query/intro/report.sql
SELECT COUNT(bp.product_id) AS how_many_products,
COUNT(dev.account_id) AS how_many_developers,
COUNT(b.bug_id)/COUNT(dev.account_id) AS avg_bugs_per_developer,
COUNT(cust.account_id) AS how_many_customers
FROM Bugs b JOIN BugsProducts bp ON (b.bug_id = bp.bug_id)
JOIN Accounts dev ON (b.assigned_to = dev.account_id)
JOIN Accounts cust ON (b.reported_by = cust.account_id)
WHERE cust.email NOT LIKE '%@example.com'
GROUP BY bp.product_id;
The numbers come back, but they seem wrong. How did we get dozens
of products? How can the average bugs fixed be exactly 1.0? And it
wasn’t the number of customers; it was the number of bugs reported
by customers that your boss needs. How can all the numbers be so far
off? This query will be a lot more complex than you thought.
Your boss hangs up the phone. “Never mind,” he sighs. “It’s too late.
Let’s clean out our desks.”
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
OBJECTIVE: DECREASE SQL QUERIES 205
18.1 Objective: Decrease SQL Queries
One of the most common places where SQL programmers get stuck is
when they ask, “How can I do this with a single query?” This question is
asked for virtually any task. Programmers have been trained that one
SQL query is difficult, complex, and expensive, so they reason that two
SQL queries must be twice as bad. More than two SQL queries to solve
a problem is generally out of the question.
Programmers can’t reduce the complexity of their tasks, but they want
to simplify the solution. They state their goal with terms like “elegant”
or “efficient,” and they think they’ve achieved those goals by solving the
task with a single query.
18.2 Antipattern: Solve a Complex Problem in One Step
SQL is a very expressive language—you can accomplish a lot in a single
query or statement. But that doesn’t mean it’s mandatory or even a
good idea to approach every task with the assumption it has to be done
in one line of code. Do you have this habit with any other programming
language you use? Probably not.
Unintended Products
One common consequence of producing all your results in one query
is a Cartesian product. This happens when two of the tables in the
query have no condition restricting their relationship. Without such a
restriction, the join of two tables pairs each row in the first table to
every row in the other table. Each such pairing becomes a row of the
result set, and you end up with many more rows than you expect.
Let’s see an example. Suppose we want to query our bugs database to
count the number of bugs fixed, and the number of bugs open, for a
given product. Many programmers would try to use a query like the
following to calculate these counts:
Download Spaghetti-Query/anti/cartesian.sql
SELECT p.product_id,
COUNT(f.bug_id) AS count_fixed,
COUNT(o.bug_id) AS count_open
FROM BugsProducts p
LEFT OUTER JOIN Bugs f ON (p.bug_id = f.bug_id AND f.status = 'FIXED')
LEFT OUTER JOIN Bugs o ON (p.bug_id = o.bug_id AND o.status = 'OPEN')
WHERE p.product_id = 1
GROUP BY p.product_id;
Report erratum
this copy is (P1.0 printing, May 2010)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.