Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Statistics and Analysis of Scientific Data
Nội dung xem thử
Mô tả chi tiết
Graduate Texts in Physics
Massimiliano Bonamente
Statistics and
Analysis of
Scienti c Data
Second Edition
Graduate Texts in Physics
Series editors
Kurt H. Becker, Polytechnic School of Engineering, Brooklyn, USA
Jean-Marc Di Meglio, Université Paris Diderot, Paris, France
Sadri Hassani, Illinois State University, Normal, USA
Bill Munro, NTT Basic Research Laboratories, Atsugi, Japan
Richard Needs, University of Cambridge, Cambridge, UK
William T. Rhodes, Florida Atlantic University, Boca Raton, USA
Susan Scott, Australian National University, Acton, Australia
H. Eugene Stanley, Boston University, Boston, USA
Martin Stutzmann, TU München, Garching, Germany
Andreas Wipf, Friedrich-Schiller-Univ Jena, Jena, Germany
Graduate Texts in Physics
Graduate Texts in Physics publishes core learning/teaching material for graduateand advanced-level undergraduate courses on topics of current and emerging fields
within physics, both pure and applied. These textbooks serve students at the
MS- or PhD-level and their instructors as comprehensive sources of principles,
definitions, derivations, experiments and applications (as relevant) for their mastery
and teaching, respectively. International in scope and relevance, the textbooks
correspond to course syllabi sufficiently to serve as required reading. Their didactic
style, comprehensiveness and coverage of fundamental material also make them
suitable as introductions or references for scientists entering, or requiring timely
knowledge of, a research field.
More information about this series at http://www.springer.com/series/8431
Massimiliano Bonamente
Statistics and Analysis
of Scientific Data
Second Edition
123
Massimiliano Bonamente
University of Alabama
Huntsville
Alabama, USA
ISSN 1868-4513 ISSN 1868-4521 (electronic)
Graduate Texts in Physics
ISBN 978-1-4939-6570-0 ISBN 978-1-4939-6572-4 (eBook)
DOI 10.1007/978-1-4939-6572-4
Library of Congress Control Number: 2016957885
1st edition: © Springer Science+Business Media New York 2013
2nd edition: © Springer Science+Business Media LLC 2017
© Springer Science+Busines Media New York 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer Science+Business Media LLC
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A
To Giorgio and Alida, who taught me the
value of a book.
To Carlo and Gaia, to whom I teach the
same.
And to Kerry, with whom I share the love
of books, and everything else.
Preface to the First Edition
Across all sciences, a quantitative analysis of data is necessary to assess the
significance of experiments, observations, and calculations. This book was written
over a period of 10 years, as I developed an introductory graduate course on
statistics and data analysis at the University of Alabama in Huntsville. My goal
was to put together the material that a student needs for the analysis and statistical
interpretation of data, including an extensive set of applications and problems that
illustrate the practice of statistical data analysis.
The literature offers a variety of books on statistical methods and probability
theory. Some are primarily on the mathematical foundations of statistics, some
are purely on the theory of probability, and others focus on advanced statistical
methods for specific sciences. This textbook contains the foundations of probability,
statistics, and data analysis methods that are applicable to a variety of fields—
from astronomy to biology, business sciences, chemistry, engineering, physics, and
more—with equal emphasis on mathematics and applications. The book is therefore
not specific to a given discipline, nor does it attempt to describe every possible
statistical method. Instead, it focuses on the fundamental methods that are used
across the sciences and that are at the basis of more specific techniques that can
be found in more specialized textbooks or research articles.
This textbook covers probability theory and random variables, maximumlikelihood methods for single variables and two-variable datasets, and more complex
topics of data fitting, estimation of parameters, and confidence intervals. Among the
topics that have recently become mainstream, Monte Carlo Markov chains occupy
a special role. The last chapter of the book provides a comprehensive overview of
Markov chains and Monte Carlo Markov chains, from theory to implementation.
I believe that a description of the mathematical properties of statistical tests is
necessary to understand their applicability. This book therefore contains mathematical derivations that I considered particularly useful for a thorough understanding of
the subject; the book refers the reader to other sources in case of mathematics that
goes beyond that of basic calculus. The reader who is not familiar with calculus may
skip those derivations and continue with the applications.
vii
viii Preface to the First Edition
Nonetheless, statistics is necessarily slanted toward applications. To highlight
the relevance of the statistical methods described, I have reported original data
from four fundamental scientific experiments from the past two centuries: J.J.
Thomson’s experiment that led to the discovery of the electron, G. Mendel’s data
on plant characteristics that led to the law of independent assortment of species,
E. Hubble’s observation of nebulae that uncovered the expansion of the universe,
and K. Pearson’s collection of biometric characteristics in the UK in the early
twentieth century. These experiments are used throughout the book to illustrate how
statistical methods are applied to actual data and are used in several end-of-chapter
problems. The reader will therefore have an opportunity to see statistics in action
on these classic experiments and several additional examples.
The material presented in this book is aimed at upper-level undergraduate
students or beginning graduate students. The reader is expected to be familiar
with basic calculus, and no prior knowledge of statistics or probability is assumed.
Professional scientists and researchers will find it a useful reference for fundamental
methods such as maximum-likelihood fit, error propagation formulas, goodness of
fit and model comparison, Monte Carlo methods such as the jackknife and bootstrap,
Monte Carlo Markov chains, Kolmogorov-Smirnov tests, and more. All subjects
are complemented by an extensive set of numerical tables that make the book
completely self-contained.
The material presented in this book can be comfortably covered in a one-semester
course and has several problems at the end of each chapter that are suitable as
homework assignments or exam questions. Problems are both of theoretical and
numerical nature, so that emphasis is equally placed on conceptual and practical
understanding of the subject. Several datasets, including those in the four “classic
experiments,” are used across several chapters, and the students can therefore use
them in applications of increasing difficulty.
Huntsville, AL, USA Massimiliano Bonamente
Preface to the Second Edition
The second edition of Statistics and Analysis of Scientific Data was motivated by
the overall goal to provide a textbook that is mathematically rigorous and easy to
read and use as a reference at the same time. Basically, it is a book for both the
student who wants to learn in detail the mathematical underpinnings of statistics
and the reader who wants to just find the practical description on how to apply a
given statistical method or use the book as a reference.
To this end, first I decided that a more clear demarcation between theoretical and
practical topics would improve the readability of the book. As a result, several pages
(i.e., mathematical derivations) are now clearly marked throughout the book with a
vertical line, to indicate material that is primarily aimed to those readers who seek
a more thorough mathematical understanding. Those parts are not required to learn
how to apply the statistical methods presented in the book. For the reader who uses
this book as a reference, this makes it easy to skip such sections and go directly
to the main results. At the end of each chapter, I also provide a summary of key
concepts, intended for a quick look-up of the results of each chapter.
Secondly, certain existing material needed substantial re-organization and expansion. The second edition is now comprised of 16 chapters, versus ten of the first
edition. A few chapters (Chap. 6 on mean, median, and averages, Chap. 9 on multivariable regression, and Chap. 11 on systematic errors and intrinsic scatter) contain
material that is substantially new. In particular, the topic of multi-variable regression
was introduced because of its use in many fields such as business and economics,
where it is common to apply the regression method to many independent variables.
Other chapters originate from re-arranging existing material more effectively. Some
of the numerical tables in both the main body and the appendix have been expanded
and re-arranged, so that the reader will find it even easier to use them for a variety
of applications and as a reference.
The second edition also contains a new classic experiment, that of the measurement of iris characteristics by R.A. Fisher and E. Anderson. These new data are used
to illustrate primarily the method of regression with many independent variables.
The textbook now features a total of five classic experiments (including G. Mendel’s
data on the independent assortment of species, J.J. Thomson’s data on the discovery
ix
x Preface to the Second Edition
of the electron, K. Pearson’s collection of data of biometric characteristics, and
E. Hubble’s measurements of the expansion of the universe). These data and their
analysis provide a unique way to learn the statistical methods presented in the book
and a resource for the student and the teacher alike. Many of the end-of-chapter
problems are based on these experimental data.
Finally, the new edition contains corrections to a number of typos that had
inadvertently entered the manuscript. I am very much in debt to many of my students
at the University of Alabama in Huntsville for pointing out these typos to me over the
past few years, in particular, to Zachary Robinson, who has patiently gone through
much of the text to find typographical errors.
Huntsville, AL, USA Massimiliano Bonamente
Acknowledgments
In my early postdoc years, I was struggling to solve a complex data analysis
problem. My longtime colleague and good friend Dr. Marshall Joy of NASA’s
Marshall Space Flight Center one day walked down to my office and said something
like “Max, I have a friend in Chicago who told me that there is a method that maybe
can help us with our problem. I don’t understand any of it, but here’s a paper that
talks about Monte Carlo Markov chains. See if it can help us.” That conversation
led to the appreciation of one of statistics and data analysis, most powerful tools
and opened the door for virtually all the research papers that I wrote ever since. For
over a decade, Marshall taught me how to be careful in the analysis of data and
interpretation of results—and always used a red felt-tip marker to write comments
on my papers.
The journey leading to this book started about 10 years ago, when Prof. A. Gordon Emslie, currently provost at Western Kentucky University, and I decided to offer
a new course in data analysis and statistics for graduate students in our department.
Gordon’s uncanny ability to solve virtually any problem presented to him—and
likewise make even the experienced scientist stumble with his questions—has been
a great source of inspiration for this book.
Some of the material presented in this book is derived from Prof. Kyle Siegrist’s
lectures on probability and stochastic processes at the University of Alabama in
Huntsville. Kyle reinforced my love for mathematics and motivated my desire to
emphasize both mathematics and applications for the material presented in this
book.
xi
Contents
1 Theory of Probability ...................................................... 1
1.1 Experiments, Events, and the Sample Space ........................ 1
1.2 Probability of Events................................................. 2
1.2.1 The Kolmogorov Axioms .................................. 2
1.2.2 Frequentist or Classical Method ........................... 3
1.2.3 Bayesian or Empirical Method............................. 4
1.3 Fundamental Properties of Probability .............................. 4
1.4 Statistical Independence ............................................. 5
1.5 Conditional Probability .............................................. 7
1.6 A Classic Experiment: Mendel’s Law of Heredity
and the Independent Assortment of Species ........................ 8
1.7 The Total Probability Theorem and Bayes’ Theorem .............. 10
2 Random Variables and Their Distributions ............................. 17
2.1 Random Variables .................................................... 17
2.2 Probability Distribution Functions .................................. 19
2.3 Moments of a Distribution Function ................................ 20
2.3.1 The Mean and the Sample Mean........................... 21
2.3.2 The Variance and the Sample Variance .................... 22
2.4 A Classic Experiment: J.J. Thomson’s Discovery
of the Electron ........................................................ 23
2.5 Covariance and Correlation Between Random Variables .......... 26
2.5.1 Joint Distribution and Moments of Two
Random Variables .......................................... 26
2.5.2 Statistical Independence of Random Variables............ 28
2.6 A Classic Experiment: Pearson’s Collection of Data on
Biometric Characteristics ............................................ 30
3 Three Fundamental Distributions: Binomial, Gaussian,
and Poisson ................................................................. 35
3.1 The Binomial Distribution ........................................... 35
3.1.1 Derivation of the Binomial Distribution ................... 35
3.1.2 Moments of the Binomial Distribution .................... 38
xiii
xiv Contents
3.2 The Gaussian Distribution ........................................... 40
3.2.1 Derivation of the Gaussian Distribution
from the Binomial Distribution ............................ 40
3.2.2 Moments and Properties of the Gaussian Distribution.... 44
3.2.3 How to Generate a Gaussian Distribution
from a Standard Normal.................................... 45
3.3 The Poisson Distribution ............................................. 45
3.3.1 Derivation of the Poisson Distribution..................... 46
3.3.2 Properties and Interpretation of the Poisson
Distribution ................................................. 47
3.3.3 The Poisson Distribution and the Poisson Process........ 48
3.3.4 An Example on Likelihood and Posterior
Probability of a Poisson Variable .......................... 49
3.4 Comparison of Binomial, Gaussian, and Poisson Distributions ... 51
4 Functions of Random Variables and Error Propagation .............. 55
4.1 Linear Combination of Random Variables .......................... 55
4.1.1 General Mean and Variance Formulas..................... 55
4.1.2 Uncorrelated Variables and the 1=pN Factor............. 56
4.2 The Moment Generating Function .................................. 58
4.2.1 Properties of the Moment Generating Function ........... 59
4.2.2 The Moment Generating Function of the
Gaussian and Poisson Distribution ........................ 59
4.3 The Central Limit Theorem.......................................... 61
4.4 The Distribution of Functions of Random Variables ............... 64
4.4.1 The Method of Change of Variables ....................... 65
4.4.2 A Method for Multi-dimensional Functions .............. 66
4.5 The Law of Large Numbers.......................................... 68
4.6 The Mean of Functions of Random Variables ...................... 69
4.7 The Variance of Functions of Random Variables
and Error Propagation Formulas..................................... 70
4.7.1 Sum of a Constant .......................................... 72
4.7.2 Weighted Sum of Two Variables........................... 72
4.7.3 Product and Division of Two Random Variables.......... 73
4.7.4 Power of a Random Variable ............................... 74
4.7.5 Exponential of a Random Variable ........................ 75
4.7.6 Logarithm of a Random Variable .......................... 75
4.8 The Quantile Function and Simulation of Random Variables...... 76
4.8.1 General Method to Simulate a Variable ................... 78
4.8.2 Simulation of a Gaussian Variable ......................... 79
5 Maximum Likelihood and Other Methods to Estimate
Variables .................................................................... 85
5.1 The Maximum Likelihood Method for Gaussian Variables ........ 85
5.1.1 Estimate of the Mean ....................................... 86
5.1.2 Estimate of the Variance.................................... 87
5.1.3 Estimate of Mean for Non-uniform Uncertainties ........ 88
Contents xv
5.2 The Maximum Likelihood Method for Other Distributions........ 90
5.3 Method of Moments.................................................. 91
5.4 Quantiles and Confidence Intervals ................................. 93
5.4.1 Confidence Intervals for a Gaussian Variable ............. 94
5.4.2 Confidence Intervals for the Mean of a Poisson
Variable ..................................................... 97
5.5 Bayesian Methods for the Poisson Mean............................ 102
5.5.1 Bayesian Expectation of the Poisson Mean ............... 102
5.5.2 Bayesian Upper and Lower Limits for a
Poisson Variable ............................................ 103
6 Mean, Median, and Average Values of Variables ....................... 107
6.1 Linear and Weighted Average ....................................... 107
6.2 The Median ........................................................... 109
6.3 The Logarithmic Average and Fractional
or Multiplicative Errors .............................................. 109
6.3.1 The Weighted Logarithmic Average ....................... 110
6.3.2 The Relative-Error Weighted Average ..................... 113
7 Hypothesis Testing and Statistics ......................................... 117
7.1 Statistics and Hypothesis Testing .................................... 117
7.2 The 2 Distribution................................................... 122
7.2.1 The Probability Distribution Function ..................... 122
7.2.2 Moments and Other Properties............................. 125
7.2.3 Hypothesis Testing ......................................... 126
7.3 The Sampling Distribution of the Variance ......................... 127
7.4 The F Statistic ........................................................ 131
7.4.1 The Probability Distribution Function ..................... 132
7.4.2 Moments and Other Properties............................. 133
7.4.3 Hypothesis Testing ......................................... 134
7.5 The Sampling Distribution of the Mean
and the Student’s t Distribution ...................................... 137
7.5.1 Comparison of Sample Mean with Parent Mean .......... 137
7.5.2 Comparison of Two Sample Means and
Hypothesis Testing ......................................... 141
8 Maximum Likelihood Methods for Two-Variable Datasets ........... 147
8.1 Measurement of Pairs of Variables .................................. 147
8.2 Maximum Likelihood Method for Gaussian Data .................. 149
8.3 Least-Squares Fit to a Straight Line, or Linear Regression ........ 150
8.4 Multiple Linear Regression .......................................... 151
8.4.1 Best-Fit Parameters for Multiple Regression.............. 152
8.4.2 Parameter Errors and Covariances for
Multiple Regression ........................................ 153
8.4.3 Errors and Covariance for Linear Regression ............. 154
8.5 Special Cases: Identical Errors or No Errors Available ............ 155
xvi Contents
8.6 A Classic Experiment: Edwin Hubble’s Discovery
of the Expansion of the Universe .................................... 157
8.7 Maximum Likelihood Method for Non-linear Functions .......... 160
8.8 Linear Regression with Poisson Data ............................... 160
9 Multi-Variable Regression ................................................ 165
9.1 Multi-Variable Datasets .............................................. 165
9.2 A Classic Experiment: The R.A. Fisher and
E. Anderson Measurements of Iris Characteristics ................. 166
9.3 The Multi-Variable Linear Regression .............................. 168
9.4 Tests for Significance of the Multiple Regression Coefficients .... 170
9.4.1 T-Test for the Significance of Model Components........ 170
9.4.2 F-Test for Goodness of Fit ................................. 172
9.4.3 The Coefficient of Determination .......................... 174
10 Goodness of Fit and Parameter Uncertainty ............................ 177
10.1 Goodness of Fit for the 2
min Fit Statistic ............................ 177
10.2 Goodness of Fit for the Cash C Statistic ............................ 180
10.3 Confidence Intervals of Parameters for Gaussian Data ............. 181
10.3.1 Confidence Interval on All Parameters .................... 183
10.3.2 Confidence Intervals on Reduced Number
of Parameters ............................................... 184
10.4 Confidence Intervals of Parameters for Poisson Data .............. 186
10.5 The Linear Correlation Coefficient .................................. 187
10.5.1 The Probability Distribution Function ..................... 188
10.5.2 Hypothesis Testing ......................................... 190
11 Systematic Errors and Intrinsic Scatter ................................. 195
11.1 What to Do When the Goodness-of-Fit Test Fails .................. 195
11.2 Intrinsic Scatter and Debiased Variance ............................. 196
11.2.1 Direct Calculation of the Intrinsic Scatter ................. 196
11.2.2 Alternative Method to Estimate the Intrinsic Scatter ..... 197
11.3 Systematic Errors..................................................... 198
11.4 Estimate of Model Parameters with Systematic Errors
or Intrinsic Scatter.................................................... 200
12 Fitting Two-Variable Datasets with Bivariate Errors .................. 203
12.1 Two-Variable Datasets with Bivariate Errors ....................... 203
12.2 Generalized Least-Squares Linear Fit to Bivariate Data ........... 204
12.3 Linear Fit Using Bivariate Errors in the 2 Statistic ................ 209
13 Model Comparison ........................................................ 211
13.1 The F Test ............................................................ 211
13.1.1 F-Test for Two Independent 2 Measurements ........... 212
13.1.2 F-Test for an Additional Model Component .............. 214
13.2 Kolmogorov–Smirnov Tests ......................................... 216
13.2.1 Comparison of Data to a Model............................ 216
13.2.2 Two-Sample Kolmogorov–Smirnov Test .................. 219