← Back to CSIR NET

Probability & Statistics

A rigorous treatment of probability theory and statistical inference — from sigma-algebras to hypothesis testing and Bayesian methods.

Probability Spaces Random Variables Expectation & Moments Limit Theorems Statistical Inference Bayesian Methods

0 of 6 units completed0%

Probability Spaces & Axioms

The measure-theoretic foundation of probability: sample spaces, sigma-algebras, and probability measures.

📐 Foundations of Probability

Definition — Probability Space A probability space is a triple \((\Omega, \mathcal{F}, P)\) where:

\(\Omega\) is the sample space (set of all outcomes)
\(\mathcal{F}\) is a sigma-algebra on \(\Omega\) (collection of events closed under complementation and countable unions)
\(P: \mathcal{F} \to [0,1]\) is a probability measure with \(P(\Omega)=1\) and countable additivity

Theorem — Kolmogorov's Axioms (Consequences) From the axioms:

\(P(\emptyset) = 0\), \(P(A^c) = 1 - P(A)\)
Inclusion-exclusion: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
Continuity: If \(A_n \uparrow A\), then \(P(A_n) \to P(A)\)
Boole's inequality: \(P\!\left(\bigcup_{n=1}^{\infty}A_n\right) \le \sum_{n=1}^{\infty}P(A_n)\)

Definition — Conditional Probability & Independence \(P(A|B) = \frac{P(A \cap B)}{P(B)}\) for \(P(B) > 0\). Events \(A, B\) are independent if \(P(A \cap B) = P(A)P(B)\). A collection \(\{A_i\}\) is mutually independent if every finite sub-collection satisfies the product rule.

Example

Show that pairwise independence does not imply mutual independence.

Let \(\Omega = \{1,2,3,4\}\) with uniform probability. Define \(A = \{1,2\}\), \(B = \{1,3\}\), \(C = \{1,4\}\). Then \(P(A)=P(B)=P(C)=1/2\), \(P(A\cap B) = P(A\cap C) = P(B\cap C) = 1/4\) — pairwise independent. But \(P(A\cap B\cap C) = P(\{1\}) = 1/4 \neq 1/8 = P(A)P(B)P(C)\). Not mutually independent.

Random Variables & Distributions

Discrete and continuous random variables, joint distributions, and conditional distributions.

🎲 Distribution Functions

Definition — Random Variable A random variable \(X: \Omega \to \mathbb{R}\) is a measurable function: \(\{X \le x\} \in \mathcal{F}\) for all \(x \in \mathbb{R}\). The CDF is \(F_X(x) = P(X \le x)\).

Properties of the CDF: \(F\) is right-continuous, non-decreasing, \(\lim_{x\to-\infty}F(x) = 0\), \(\lim_{x\to\infty}F(x) = 1\). For a continuous r.v., the PDF satisfies \(f(x) = F'(x)\) and \(P(a < X \le b) = \int_a^b f(x)\,dx\).

Definition — Joint, Marginal, Conditional For random variables \((X,Y)\):

Joint PDF: \(f_{X,Y}(x,y)\) with \(\int\!\!\int f_{X,Y}\,dx\,dy = 1\)
Marginal: \(f_X(x) = \int f_{X,Y}(x,y)\,dy\)
Conditional: \(f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}\)

\(X, Y\) are independent iff \(f_{X,Y}(x,y) = f_X(x)f_Y(y)\).

Example

Let \((X,Y)\) have joint PDF \(f(x,y) = 6(1-y)\) on the triangle \(0 < x < y < 1\). Find \(f_X(x)\).

\(f_X(x) = \int_x^1 6(1-y)\,dy = 6\left[(1-y) \cdot (-1)\right]\) — better: \(\int_x^1 6(1-y)\,dy = 6\left[y - \frac{y^2}{2}\right]_x^1 = 6\left(\frac{1}{2} - x + \frac{x^2}{2}\right) = 3(1-x)^2\) for \(0 < x < 1\).

Expectation, Variance & Generating Functions

Moments, moment generating functions, and characteristic functions — the analytical tools of probability.

📊 Moments & Generating Functions

Definition — Moment Generating Function The MGF of \(X\) is \(M_X(t) = E[e^{tX}]\), defined for \(t\) in a neighborhood of 0. Key property: \(E[X^n] = M_X^{(n)}(0)\). If \(M_X(t) = M_Y(t)\) in a neighborhood of 0, then \(X\) and \(Y\) have the same distribution.

Definition — Characteristic Function The characteristic function \(\varphi_X(t) = E[e^{itX}]\) always exists (unlike the MGF). It uniquely determines the distribution. For independent \(X, Y\): \(\varphi_{X+Y}(t) = \varphi_X(t)\varphi_Y(t)\).

Theorem — Variance & Covariance Properties

\(\text{Var}(X) = E[X^2] - (E[X])^2 \ge 0\)
\(\text{Var}(aX+b) = a^2\text{Var}(X)\)
\(\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)\)
Cauchy-Schwarz: \(|\text{Cov}(X,Y)| \le \sqrt{\text{Var}(X)\text{Var}(Y)}\), so \(|\rho(X,Y)| \le 1\)

Example

Find the MGF of the Poisson distribution with parameter \(\lambda\) and derive its mean and variance.

\(M_X(t) = \sum_{k=0}^{\infty}e^{tk}\frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda}\sum_{k=0}^{\infty}\frac{(\lambda e^t)^k}{k!} = e^{-\lambda}e^{\lambda e^t} = e^{\lambda(e^t - 1)}\). \(M'_X(t) = \lambda e^t e^{\lambda(e^t-1)}\), so \(E[X] = M'(0) = \lambda\). \(M''_X(0) = \lambda + \lambda^2\), so \(\text{Var}(X) = \lambda + \lambda^2 - \lambda^2 = \lambda\).

Limit Theorems

Modes of convergence and the great limit theorems: laws of large numbers and the central limit theorem.

🔄 Modes of Convergence

Definition — Convergence Modes For a sequence of random variables \(\{X_n\}\):

Almost sure (a.s.): \(P\!\left(\lim_{n\to\infty}X_n = X\right) = 1\)
In probability: \(P(|X_n - X| > \varepsilon) \to 0\) for all \(\varepsilon > 0\)
In \(L^p\): \(E[|X_n - X|^p] \to 0\)
In distribution: \(F_{X_n}(x) \to F_X(x)\) at all continuity points of \(F_X\)

Hierarchy: a.s. \(\Rightarrow\) in probability \(\Rightarrow\) in distribution. Also \(L^p \Rightarrow\) in probability.

⚖️ Laws of Large Numbers & CLT

Theorem — Weak Law of Large Numbers (WLLN) If \(X_1, X_2, \ldots\) are i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) < \infty\), then \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu\).

Theorem — Strong Law of Large Numbers (SLLN) If \(X_1, X_2, \ldots\) are i.i.d. with \(E[|X_1|] < \infty\) and \(E[X_1] = \mu\), then \(\bar{X}_n \xrightarrow{\text{a.s.}} \mu\). The finite variance assumption is not needed — only finite first moment.

Theorem — Central Limit Theorem (CLT) If \(X_1, X_2, \ldots\) are i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2 < \infty\), then: \[\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{d} N(0,1)\]

Example

Use the CLT to approximate \(P(S_{100} > 55)\) where \(S_{100} = \sum_{i=1}^{100}X_i\) and \(X_i \sim \text{Bernoulli}(1/2)\).

\(\mu = 1/2\), \(\sigma^2 = 1/4\), \(n = 100\). \(E[S_{100}]=50\), \(\text{Var}(S_{100})=25\). \[P(S_{100}>55) = P\!\left(\frac{S_{100}-50}{5}>1\right) \approx 1 - \Phi(1) \approx 1 - 0.8413 = 0.1587\]

Statistical Inference

Point estimation, hypothesis testing, and the classical theorems of mathematical statistics.

🎯 Estimation Theory

Definition — Maximum Likelihood Estimator (MLE) For a parametric model \(\{f(x|\theta)\}\), the MLE is \(\hat{\theta}_{\text{MLE}} = \arg\max_\theta L(\theta) = \arg\max_\theta \prod_{i=1}^n f(x_i|\theta)\). Equivalently, maximize the log-likelihood \(\ell(\theta) = \sum \ln f(x_i|\theta)\).

Theorem — Cramer-Rao Lower Bound For an unbiased estimator \(T\) of \(\theta\): \[\text{Var}(T) \ge \frac{1}{nI(\theta)}\] where \(I(\theta) = E\!\left[-\frac{\partial^2 \ln f}{\partial\theta^2}\right]\) is the Fisher information. An estimator achieving equality is the UMVUE (uniformly minimum variance unbiased estimator).

Theorem — Rao-Blackwell If \(T\) is an unbiased estimator of \(\theta\) and \(S\) is a sufficient statistic, then \(T^* = E[T|S]\) is also unbiased and satisfies \(\text{Var}(T^*) \le \text{Var}(T)\). Combined with a complete sufficient statistic, this yields the UMVUE.

⚡ Hypothesis Testing

Theorem — Neyman-Pearson Lemma For testing \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\), the most powerful test of size \(\alpha\) rejects \(H_0\) when the likelihood ratio \(\Lambda = \frac{L(\theta_1)}{L(\theta_0)}\) exceeds a threshold \(k_\alpha\). No other test of size \(\le \alpha\) has greater power.

Definition — Likelihood Ratio Test For composite hypotheses, the generalized likelihood ratio test uses: \[\Lambda = \frac{\sup_{\theta \in \Theta_0}L(\theta)}{\sup_{\theta \in \Theta}L(\theta)}\] Reject \(H_0\) when \(\Lambda \le c\). Under regularity conditions, \(-2\ln\Lambda \xrightarrow{d} \chi^2_k\) where \(k = \dim\Theta - \dim\Theta_0\).

Example

For \(X_1,\ldots,X_n\) i.i.d. \(N(\mu,\sigma_0^2)\) with known \(\sigma_0^2\), find the MLE of \(\mu\) and verify it achieves the CRLB.

\(\ell(\mu) = -\frac{n}{2}\ln(2\pi\sigma_0^2) - \frac{1}{2\sigma_0^2}\sum(x_i-\mu)^2\). Setting \(\ell'(\mu)=0\): \(\hat\mu = \bar{X}\). Fisher information: \(I(\mu) = 1/\sigma_0^2\). CRLB: \(\sigma_0^2/n\). Since \(\text{Var}(\bar{X}) = \sigma_0^2/n\), the MLE achieves the CRLB and is the UMVUE.

Bayesian Inference & Regression

The Bayesian paradigm and linear regression models.

🔮 Bayesian Framework & Regression

Theorem — Bayes' Theorem (Continuous) Given prior \(\pi(\theta)\) and likelihood \(f(x|\theta)\), the posterior distribution is: \[\pi(\theta|x) = \frac{f(x|\theta)\pi(\theta)}{\int f(x|\theta)\pi(\theta)\,d\theta} \propto f(x|\theta)\pi(\theta)\] The Bayesian estimator under squared error loss is the posterior mean \(E[\theta|x]\).

Definition — Conjugate Priors A prior \(\pi(\theta)\) is conjugate to a likelihood \(f(x|\theta)\) if the posterior \(\pi(\theta|x)\) belongs to the same family. Key pairs:

Normal likelihood + Normal prior \(\to\) Normal posterior
Binomial likelihood + Beta prior \(\to\) Beta posterior
Poisson likelihood + Gamma prior \(\to\) Gamma posterior

Theorem — Gauss-Markov In the linear model \(\mathbf{Y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}\) with \(E[\boldsymbol{\varepsilon}]=\mathbf{0}\), \(\text{Cov}(\boldsymbol{\varepsilon})=\sigma^2 I\), the OLS estimator \(\hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^T\mathbf{Y}\) is the BLUE (Best Linear Unbiased Estimator) — it has the smallest variance among all linear unbiased estimators.

Example

Let \(X|\theta \sim \text{Poisson}(\theta)\) with prior \(\theta \sim \text{Gamma}(\alpha,\beta)\). Find the posterior after observing \(x_1,\ldots,x_n\).

Likelihood: \(L(\theta) \propto \theta^{\sum x_i}e^{-n\theta}\). Prior: \(\pi(\theta) \propto \theta^{\alpha-1}e^{-\beta\theta}\). Posterior: \(\pi(\theta|x) \propto \theta^{\alpha + \sum x_i - 1}e^{-(\beta+n)\theta}\), which is \(\text{Gamma}(\alpha + \sum x_i,\; \beta + n)\). The posterior mean is \(\frac{\alpha + \sum x_i}{\beta + n}\) — a weighted average of the prior mean \(\alpha/\beta\) and the sample mean \(\bar{x}\).

Key Takeaways

The sigma-algebra framework makes probability rigorous; pairwise independence is strictly weaker than mutual independence.
The MGF and characteristic function uniquely determine distributions and simplify moment calculations.
SLLN needs only finite first moment; CLT needs finite variance — know the minimal assumptions.
The Cramer-Rao bound provides the best-case variance; Rao-Blackwell gives a constructive path to UMVUE.
Neyman-Pearson is optimal for simple hypotheses; generalized LRT extends to composite hypotheses.
Conjugate priors allow closed-form Bayesian updates; the Gauss-Markov theorem underpins linear regression.

Practice Problems

Problem 1

Let \(X \sim \text{Exp}(\lambda)\). Compute the characteristic function \(\varphi_X(t)\) and use it to find \(E[X]\) and \(E[X^2]\).

Show Solution ▼

\(\varphi_X(t) = \frac{\lambda}{\lambda - it}\). \(E[X] = \frac{1}{i}\varphi'(0) = 1/\lambda\). \(\varphi''(t) = \frac{-2\lambda}{(\lambda-it)^3}\), \(E[X^2] = \frac{1}{i^2}\varphi''(0) = 2/\lambda^2\). So \(\text{Var}(X) = 1/\lambda^2\).

Problem 2

Prove that convergence in probability does not imply almost sure convergence by constructing a counterexample.

Show Solution ▼

The "typewriter sequence": partition \([0,1)\) into intervals of shrinking length. Let \(X_n = \mathbf{1}_{I_n}\) where \(I_n\) cycles through \([0,1), [0,1/2), [1/2,1), [0,1/3),\ldots\). Then \(P(|X_n|>\varepsilon)\to 0\) (convergence in probability to 0), but for each \(\omega\), \(X_n(\omega)=1\) infinitely often, so \(X_n\) does not converge a.s.

Problem 3

For \(X_1,\ldots,X_n\) i.i.d. \(\text{Uniform}(0,\theta)\), find the MLE of \(\theta\) and determine if it is unbiased.

Show Solution ▼

\(L(\theta) = \theta^{-n}\mathbf{1}(\theta \ge X_{(n)})\) where \(X_{(n)} = \max X_i\). This is maximized at \(\hat\theta = X_{(n)}\). \(E[X_{(n)}] = \frac{n}{n+1}\theta \neq \theta\), so it is biased. The unbiased version is \(\frac{n+1}{n}X_{(n)}\).

Problem 4

State and apply the Neyman-Pearson lemma to test \(H_0: \mu = 0\) vs \(H_1: \mu = 1\) for a single observation \(X \sim N(\mu, 1)\) at level \(\alpha = 0.05\).

Show Solution ▼

Likelihood ratio: \(\Lambda = \frac{f(x|1)}{f(x|0)} = e^{x - 1/2}\). Reject \(H_0\) when \(\Lambda > k\), i.e., \(x > c\). At \(\alpha=0.05\): \(P_0(X>c) = 0.05\), so \(c = z_{0.05} = 1.645\). Reject if \(X > 1.645\). Power: \(P_1(X>1.645) = P(Z>0.645) \approx 0.2594\).

Problem 5

Show that the Beta posterior mean from a Binomial-Beta conjugate model is a weighted average of the prior mean and the sample proportion.

Show Solution ▼

Prior: \(\theta \sim \text{Beta}(\alpha,\beta)\), data: \(X \sim \text{Bin}(n,\theta)\). Posterior: \(\theta|X=x \sim \text{Beta}(\alpha+x, \beta+n-x)\). Posterior mean: \(\frac{\alpha+x}{\alpha+\beta+n} = \frac{\alpha+\beta}{\alpha+\beta+n}\cdot\frac{\alpha}{\alpha+\beta} + \frac{n}{\alpha+\beta+n}\cdot\frac{x}{n}\). This is a weighted average of prior mean \(\alpha/(\alpha+\beta)\) and sample proportion \(x/n\), with weights proportional to "prior sample size" \(\alpha+\beta\) and actual sample size \(n\).

Self-Assessment Quiz

1. The Strong Law of Large Numbers requires:

A Finite variance

B Finite first moment

C Existence of the MGF

D Bounded support

2. The Cramer-Rao lower bound for the variance of an unbiased estimator of \(\theta\) based on \(n\) i.i.d. observations is:

A \(\frac{1}{I(\theta)}\)

B \(\frac{1}{nI(\theta)}\)

C \(\frac{1}{n^2 I(\theta)}\)

D \(\frac{\sigma^2}{n}\)

3. In the hierarchy of convergence, which implication is TRUE?

A In distribution \(\Rightarrow\) in probability

B Almost sure \(\Rightarrow\) in probability

C In probability \(\Rightarrow\) almost sure

D In distribution \(\Rightarrow\) in \(L^2\)

4. The Neyman-Pearson lemma provides the most powerful test for:

A Simple vs simple hypotheses

B Composite vs composite hypotheses

C Simple vs one-sided composite

D Normal distributions only

5. In the Gauss-Markov theorem, OLS is the BLUE. What does "Best" refer to?

A Most consistent

B Minimum variance among linear unbiased estimators

C Minimum sum of squared errors

D Most admissible