Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Seven Databases in Seven Weeks
Nội dung xem thử
Mô tả chi tiết
What Readers Are Saying About
Seven Databases in Seven Weeks, Second Edition
Choosing a database is perhaps one of the most important architectural decisions
a developer can make. Seven Databases in Seven Weeks provides a fantastic tour
of different technologies and makes it easy to add each to your engineering toolbox.
➤ Dave Parfitt
Senior Site Reliability Engineer, Mozilla
By comparing each database technology to a tool you’d find in any workshop, the
authors of Seven Databases in Seven Weeks provide a practical and well-balanced
survey of a very diverse and highly varied datastore landscape. Anyone looking
to get a handle on the database options available to them as a data platform
should read this book and consider the trade-offs presented for each option.
➤ Matthew Oldham
Director of Data Architecture, Graphium Health
Reading this book felt like some of my best pair-programming experiences. It
showed me how to get started, kept me engaged, and encouraged me to experiment
on my own.
➤ Jesse Hallett
Open Source Mentor
This book will really give you an overview of what’s out there so you can choose
the best tool for the job.
➤ Jesse Anderson
Managing Director, Big Data Institute
We've left this page blank to
make the page numbers the
same in the electronic and
paper books.
We tried just leaving it out,
but then people wrote us to
ask about the missing pages.
Anyway, Eddy the Gerbil
wanted to say “hello.”
Seven Databases in Seven Weeks,
Second Edition
A Guide to Modern Databases and the NoSQL Movement
Luc Perkins
with Eric Redmond
and Jim R. Wilson
The Pragmatic Bookshelf
Raleigh, North Carolina
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and The Pragmatic
Programmers, LLC was aware of a trademark claim, the designations have been printed in
initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer,
Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trademarks of The Pragmatic Programmers, LLC.
Every precaution was taken in the preparation of this book. However, the publisher assumes
no responsibility for errors or omissions, or for damages that may result from the use of
information (including program listings) contained herein.
Our Pragmatic books, screencasts, and audio books can help you and your team create
better software and have more fun. Visit us at https://pragprog.com.
The team that produced this book includes:
Publisher: Andy Hunt
VP of Operations: Janet Furlow
Managing Editor: Brian MacDonald
Supervising Editor: Jacquelyn Carter
Series Editor: Bruce A. Tate
Copy Editor: Nancy Rapoport
Indexing: Potomac Indexing, LLC
Layout: Gilson Graphics
For sales, volume licensing, and support, please contact [email protected].
For international rights, please contact [email protected].
Copyright © 2018 The Pragmatic Programmers, LLC.
All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted,
in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior consent of the publisher.
Printed in the United States of America.
ISBN-13: 978-1-68050-253-4
Encoded using the finest acid-free high-entropy binary digits.
Book version: P1.0—April 2018
Contents
Acknowledgments . . . . . . . . . . . vii
Preface . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . 1
It Starts with a Question 2
The Genres 3
Onward and Upward 8
2. PostgreSQL . . . . . . . . . . . . . 9
That’s Post-greS-Q-L 9
Day 1: Relations, CRUD, and Joins 10
Day 2: Advanced Queries, Code, and Rules 21
Day 3: Full Text and Multidimensions 36
Wrap-Up 50
3. HBase . . . . . . . . . . . . . . 53
Introducing HBase 54
Day 1: CRUD and Table Administration 55
Day 2: Working with Big Data 67
Day 3: Taking It to the Cloud 82
Wrap-Up 88
4. MongoDB . . . . . . . . . . . . . 93
Hu(mongo)us 93
Day 1: CRUD and Nesting 94
Day 2: Indexing, Aggregating, Mapreduce 110
Day 3: Replica Sets, Sharding, GeoSpatial, and GridFS 124
Wrap-Up 132
5. CouchDB . . . . . . . . . . . . . 135
Relaxing on the Couch 135
Day 1: CRUD, Fauxton, and cURL Redux 137
Day 2: Creating and Querying Views 145
Day 3: Advanced Views, Changes API, and Replicating Data 158
Wrap-Up 174
6. Neo4J . . . . . . . . . . . . . . 177
Neo4j Is Whiteboard Friendly 177
Day 1: Graphs, Cypher, and CRUD 179
Day 2: REST, Indexes, and Algorithms 189
Day 3: Distributed High Availability 202
Wrap-Up 207
7. DynamoDB . . . . . . . . . . . . . 211
DynamoDB: The “Big Easy” of NoSQL 211
Day 1: Let’s Go Shopping! 216
Day 2: Building a Streaming Data Pipeline 233
Day 3: Building an “Internet of Things” System
Around DynamoDB 246
Wrap-Up 255
8. Redis . . . . . . . . . . . . . . 259
Data Structure Server Store 259
Day 1: CRUD and Datatypes 260
Day 2: Advanced Usage, Distribution 274
Day 3: Playing with Other Databases 289
Wrap-Up 303
9. Wrapping Up . . . . . . . . . . . . 305
Genres Redux 305
Making a Choice 309
Where Do We Go from Here? 309
A1. Database Overview Tables . . . . . . . . . 311
A2. The CAP Theorem . . . . . . . . . . . 315
Eventual Consistency 316
CAP in the Wild 317
The Latency Trade-Off 317
Bibliography . . . . . . . . . . . . 319
Index . . . . . . . . . . . . . . 321
Contents • vi
Acknowledgments
A book with the size and scope of this one is never the work of just the authors,
even if there are three of them. It requires the effort of many very smart people
with superhuman eyes spotting as many mistakes as possible and providing
valuable insights into the details of these technologies.
We’d like to thank, in no particular order, all of the folks who provided their
time and expertise:
Dave Parfitt Jerry Sievert Jesse Hallett
Matthew Oldham Ben Rady Nick Capito
Jesse Anderson Sean Moubry
Finally, thanks to Bruce Tate for his experience and guidance.
We’d also like to sincerely thank the entire team at the Pragmatic Bookshelf.
Thanks for entertaining this audacious project and seeing us through it. We’re
especially grateful to our editor, Jackie Carter. Your patient feedback made
this book what it is today. Thanks to the whole team who worked so hard to
polish this book and find all of our mistakes.
For anyone we missed, we hope you’ll accept our apologies. Any omissions
were certainly not intentional.
From Eric: Dear Noelle, you’re not special; you’re unique, and that’s so much
better. Thanks for living through another book. Thanks also to the database
creators and committers for providing us something to write about and make
a living at.
From Luc: First, I have to thank my wonderful family and friends for making
my life a charmed one from the very beginning. Second, I have to thank a
handful of people who believed in me and gave me a chance in the tech industry
at different stages of my career: Lucas Carlson, Marko and Saša Gargenta,
Troy Howard, and my co-author Eric Redmond for inviting me on board to
report erratum • discuss
prepare the most recent edition of this book. My journey in this industry has
changed my life and I thank all of you for crucial breakthroughs.
From Jim: First, I want to thank my family: Ruthy, your boundless patience
and encouragement have been heartwarming. Emma and Jimmy, you’re two
smart cookies, and your daddy loves you always. Also, a special thanks to all
the unsung heroes who monitor IRC, message boards, mailing lists, and bug
systems ready to help anyone who needs you. Your dedication to open source
keeps these projects kicking.
Acknowledgments • viii
report erratum • discuss
Preface
If we use oil extraction as a metaphor for understanding data in the contemporary world, then databases flat-out constitute—or are deeply intertwined
with—all aspects of the extraction chain, from the fields to the refineries,
drills, and pumps. If you want to harness the potential of data—which has
perhaps become as vital to our way of life as oil—then you need to understand
databases because they are quite simply the most important piece of modern
data equipment.
Databases are tools, a means to an end. But like any complex tool, databases
also harbor their own stories and embody their own ways of looking at the
world. The better you understand databases, the more capable you’ll be of
tapping into the ever-growing corpus of data at our disposal. That enhanced
understanding could lead to anything from undertaking fun side projects to
embarking on a career change or starting your own data-driven company.
Why a NoSQL Book
What exactly does the term NoSQL even mean? Which types of systems does
the term include? How will NoSQL impact the practice of making great software? These were questions we wanted to answer—as much for ourselves as
for others.
Looking back more than a half-decade later, the rise of NoSQL isn’t so much
buzzworthy as it is an accepted fact. You can still read plenty of articles about
NoSQL technologies on HackerNews, TechCrunch, or even WIRED, but the
tenor of those articles has changed from starry-eyed prophecy (“NoSQL will
change everything!”) to more standard reporting (“check out this new Redis
feature!”). NoSQL is now a mainstay and a steadily maturing one at that.
But don’t write a eulogy for relational databases—the “SQL” in “NoSQL”—just
yet. Although NoSQL databases have gained significant traction in the technological landscape, it’s still far too early to declare “traditional” relational
database models as dead or even dying. From the release of Google’s BigQuery
report erratum • discuss
and Spanner to continued rapid development of MySQL, PostgreSQL, and
others, relational databases are showing no signs of slowing down. NoSQL
hasn’t killed SQL; instead, the galaxy of uses for data has expanded, and
both paradigms continue to grow and evolve to keep up with the demand.
So read this book as a guide to powerful, compelling databases with similar
worldviews—not as a guide to the “new” way of doing things or as a nail in the
coffin of SQL. We’re writing a second edition so that a new generation of data
engineers, application developers, and others can get a high-level understanding and deep dive into specific databases in one place.
Why Seven Databases
This book’s format originally came to us when we read Bruce Tate’s exemplary
Seven Languages in Seven Weeks [Tat10] many years ago. That book’s style of
progressively introducing languages struck a chord with us. We felt teaching
databases in the same manner would provide a smooth medium for tackling
some of these tough NoSQL questions while also creating conceptual bridges
across chapters.
What’s in This Book
This book is aimed at experienced application developers, data engineers,
tech enthusiasts, and others who are seeking a well-rounded understanding
of the modern database landscape. Prior database experience is not strictly
required, but it helps.
After a brief introduction, this book tackles a series of seven databases
chapter by chapter. The databases were chosen to span five different database
genres or styles, which are discussed in Chapter 1, Introduction, on page 1.
In order, the databases covered are PostgreSQL, Apache HBase, MongoDB,
Apache CouchDB, Neo4J, DynamoDB, and Redis.
Each chapter is designed to be taken as a long weekend’s worth of work, split
up into three days. Each day ends with exercises that expand on the topics
and concepts just introduced, and each chapter culminates in a wrap-up
discussion that summarizes the good and bad points about the database.
You may choose to move a little faster or slower, but it’s important to grasp
each day’s concepts before continuing. We’ve tried to craft examples that
explore each database’s distinguishing features. To really understand what
these databases have to offer, you have to spend some time using them, and
that means rolling up your sleeves and doing some work.
Preface • x
report erratum • discuss
Although you may be tempted to skip chapters, we designed this book to be
read linearly. Some concepts, such as mapreduce, are introduced in depth
in earlier chapters and then skimmed over in later ones. The goal of this book
is to attain a solid understanding of the modern database field, so we recommend you read them all.
What This Book Is Not
Before reading this book, you should know what it won’t cover.
This Is Not an Installation Guide
Installing the databases in this book is sometimes easy, sometimes a bit of
a challenge, and sometimes downright frustrating. For some databases, you’ll
be able to use stock packages or tools such as apt-get (on many Linux systems)
or Homebrew (if you’re a Mac OS user) and for others you may need to compile
from source. We’ll point out some useful tips here and there, but by and large
you’re on your own. Cutting out installation steps allows us to pack in more
useful examples and a discussion of concepts, which is what you really came
for anyway, right?
Administration Manual? We Think Not
In addition to installation, this book will also not cover everything you’d find
in an administration manual. Each of these databases offers myriad options,
settings, switches, and configuration details, most of which are well covered
online in each database’s official documentation and on forums such as
StackOverflow. We’re much more interested in teaching you useful concepts
and providing full immersion than we are in focusing on the day-to-day
operations. Though the characteristics of the databases can change based
on operational settings—and we discuss these characteristics in some chapters
—we won’t be able to go into all the nitty-gritty details of all possible configurations. There simply isn’t space!
A Note to Windows Users
This book is inherently about choices, predominantly open source software
on *nix platforms. Microsoft environments tend to strive for an integrated
environment, which limits many choices to a smaller predefined set. As such,
the databases we cover are open source and are developed by (and largely
for) users of *nix systems. This is not our own bias so much as a reflection
of the current state of affairs.
report erratum • discuss
What This Book Is Not • xi
Consequently, our tutorial-esque examples are presumed to be run in a *nix
shell. If you run Windows and want to give it a try anyway, we recommend
setting up Bash on Windows1
or Cygwin2
to give you the best shot at success.
You may also want to consider running a Linux virtual machine.
Code Examples and Conventions
This book contains code in a variety of languages. In part, this is a consequence of the databases that we cover. We’ve attempted to limit our choice
of languages to Ruby/JRuby and JavaScript. We prefer command-line tools
to scripts, but we will introduce other languages to get the job done—such
as PL/pgSQL (Postgres) and Cypher (Neo4J). We’ll also explore writing some
server-side JavaScript applications with Node.js.
Except where noted, code listings are provided in full, usually ready to be
executed at your leisure. Samples and snippets are syntax highlighted according to the rules of the language involved. Shell commands are prefixed by $
for *nix shells or by a different token for database-specific shells (such as >
in MongoDB).
Credits
Apache, Apache HBase, Apache CouchDB, HBase, CouchDB, and the HBase
and CouchDB logos are trademarks of The Apache Software Foundation. Used
with permission. No endorsement by The Apache Software Foundation is
implied by the use of these marks.
Online Resources
The Pragmatic Bookshelf’s page for this book3
is a great resource. There you’ll
find downloads for all the source code presented in this book. You’ll also find
feedback tools such as a community forum and an errata submission form
where you can recommend changes to future releases of the book.
Thanks for coming along with us on this journey through the modern database
landscape.
Luc Perkins, Eric Redmond, and Jim R. Wilson
April 2018
1. https://msdn.microsoft.com/en-us/commandline/wsl/about
2. http://www.cygwin.com/
3. http://pragprog.com/book/pwrdata/seven-databases-in-seven-weeks
Preface • xii
report erratum • discuss
CHAPTER 1
Introduction
The non-relational database paradigm—we’ll call it NoSQL throughout this
book, following now-standard usage—is no longer the fledgling upstart that
it once was. When the NoSQL alternative to relational databases came on the
scene, the “old” model was the de facto option for problems big and small.
Today, that relational model is still going strong and for many reasons:
• Databases such as PostgreSQL, MySQL, Microsoft SQL Server, and Oracle,
amongst many others, are still widely used, discussed, and actively
developed.
• Knowing how to run SQL queries remains a highly sought-after skill for
software engineers, data analysts, and others.
• There remains a vast universe of use cases for which a relational database
is still beyond any reasonable doubt the way to go.
But at the same time, NoSQL has risen far beyond its initial upstart status
and is now a fixture in the technology world. The concepts surrounding it,
such as the CAP theorem, are widely discussed at programming conferences,
on Hacker News, on StackOverflow, and beyond. Schemaless design, massive
horizontal scaling capabilities, simple replication, new query methods that
don’t feel like SQL at all—these hallmarks of NoSQL have all gone mainstream.
Not long ago, a Fortune 500 CTO may have looked at NoSQL solutions with
bemusement if not horror; now, a CTO would be crazy not to at least consider
them for some of their workloads.
In this book, we explore seven databases across a wide spectrum of database
styles. We start with a relational database, PostgreSQL, largely for the sake of
comparison (though Postgres is quite interesting in its own right). From there,
things get a lot stranger as we wade into a world of databases united above
all by what they aren’t. In the process of reading this book, you will learn the
report erratum • discuss