Sampling & Experimental Design
A comprehensive study of survey sampling techniques and experimental design methods — from simple random sampling to factorial experiments and ANOVA.
Simple Random Sampling
Stratified & Systematic
PPS & Ratio-Regression
CRD & Block Designs
BIBD & Factorial
ANOVA & Comparisons
0 of 6 units completed0%
01
Simple Random Sampling
Fundamental sampling technique: SRS with and without replacement, estimation of population parameters, and sample size determination.
SRS Framework
Definition — Simple Random Sampling Without Replacement (SRSWOR)
A sample of size \(n\) drawn from a population of \(N\) units such that every subset of size \(n\) has equal probability \(\binom{N}{n}^{-1}\) of being selected. The inclusion probability for each unit is \(\pi_i = n/N\).
Theorem — Estimation of Population Mean
Under SRSWOR, the sample mean \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\) is an unbiased estimator of the population mean \(\bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i\):
\[E[\bar{y}] = \bar{Y}, \qquad V(\bar{y}) = \frac{S^2}{n}\left(1 - \frac{n}{N}\right)\]
where \(S^2 = \frac{1}{N-1}\sum_{i=1}^N(Y_i - \bar{Y})^2\) and the factor \((1 - n/N)\) is the finite population correction (fpc). An unbiased estimator of \(V(\bar{y})\) is:
\[\hat{V}(\bar{y}) = \frac{s^2}{n}\left(1 - \frac{n}{N}\right), \quad s^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i - \bar{y})^2\]
Definition — SRS With Replacement (SRSWR)
Each draw selects a unit with probability \(1/N\), independently. The variance of \(\bar{y}\) under SRSWR is:
\[V_{\text{WR}}(\bar{y}) = \frac{\sigma^2}{n}, \quad \sigma^2 = \frac{1}{N}\sum_{i=1}^N(Y_i - \bar{Y})^2\]
Note: \(V_{\text{WOR}}(\bar{y}) \le V_{\text{WR}}(\bar{y})\) always, so SRSWOR is more efficient.
Theorem — Sample Size Determination
To estimate \(\bar{Y}\) with margin of error \(e\) at confidence level \(1-\alpha\):
\[n = \frac{N z_{\alpha/2}^2 S^2}{(N-1)e^2 + z_{\alpha/2}^2 S^2}\]
For proportions, replace \(S^2\) with \(P(1-P)\) where \(P\) is the population proportion.
Example
A population of \(N = 500\) has \(S^2 = 25\). Find the sample size needed to estimate \(\bar{Y}\) within \(\pm 1\) at 95% confidence.
Using \(z_{0.025} = 1.96\): \(n = \frac{500 \times (1.96)^2 \times 25}{499 \times 1 + (1.96)^2 \times 25} = \frac{500 \times 3.8416 \times 25}{499 + 96.04} = \frac{48020}{595.04} \approx 80.7\). Take \(n = 81\).
02
Stratified & Systematic Sampling
Stratification for variance reduction, allocation methods, and systematic sampling schemes.
Stratified Sampling
Definition — Stratified Random Sampling
The population of \(N\) units is divided into \(L\) non-overlapping strata of sizes \(N_1, N_2, \ldots, N_L\) with \(\sum N_h = N\). An independent SRS of size \(n_h\) is drawn from stratum \(h\). The stratified estimator of \(\bar{Y}\) is:
\[\bar{y}_{st} = \sum_{h=1}^L W_h \bar{y}_h, \quad W_h = \frac{N_h}{N}\]
with \(V(\bar{y}_{st}) = \sum_{h=1}^L W_h^2 \frac{S_h^2}{n_h}\left(1 - \frac{n_h}{N_h}\right)\).
Theorem — Optimal (Neyman) Allocation
The variance \(V(\bar{y}_{st})\) is minimized for fixed total sample size \(n = \sum n_h\) when:
\[n_h = n \cdot \frac{N_h S_h}{\sum_{k=1}^L N_k S_k}\]
Under Neyman allocation: \(V_{\text{opt}}(\bar{y}_{st}) = \frac{1}{n}\left(\sum_{h=1}^L W_h S_h\right)^2 - \frac{1}{N}\sum_{h=1}^L W_h S_h^2\).
If costs vary by stratum with per-unit cost \(c_h\), the cost-optimal allocation is \(n_h \propto N_h S_h / \sqrt{c_h}\).
If costs vary by stratum with per-unit cost \(c_h\), the cost-optimal allocation is \(n_h \propto N_h S_h / \sqrt{c_h}\).
Definition — Proportional Allocation
\(n_h = n \cdot W_h = n \cdot N_h / N\). Under proportional allocation:
\[V_{\text{prop}}(\bar{y}_{st}) = \frac{1}{n}\left(1 - \frac{n}{N}\right)\sum_{h=1}^L W_h S_h^2\]
This always improves upon SRS when strata means differ.
Example
A population has 3 strata: \(N_1=200, N_2=300, N_3=500\) with \(S_1=10, S_2=15, S_3=20\). For \(n=100\), find the Neyman allocation.
\(\sum N_h S_h = 200(10) + 300(15) + 500(20) = 2000 + 4500 + 10000 = 16500\).
\(n_1 = 100 \times 2000/16500 \approx 12\), \(n_2 = 100 \times 4500/16500 \approx 27\), \(n_3 = 100 \times 10000/16500 \approx 61\).
Systematic & Cluster Sampling
Definition — Systematic Sampling
For a population of \(N = nk\) units, select a random start \(r \in \{1, 2, \ldots, k\}\) and include units \(r, r+k, r+2k, \ldots, r+(n-1)k\). The estimator \(\bar{y}_{sys}\) is unbiased for \(\bar{Y}\) but its variance depends on the intraclass correlation:
\[V(\bar{y}_{sys}) = \frac{S^2}{n}\left(1 + (n-1)\rho_{sys}\right)\frac{N-1}{N}\]
where \(\rho_{sys}\) measures the correlation within systematic samples. When the population is in random order, systematic sampling approximates SRSWOR.
Definition — Cluster Sampling
The population is divided into \(M\) clusters. A sample of \(m\) clusters is selected by SRS, and all units within selected clusters are observed. The unbiased estimator of the population total is:
\[\hat{Y}_{cl} = \frac{M}{m}\sum_{i=1}^m Y_i\]
where \(Y_i\) is the cluster total. Cluster sampling is less efficient than SRS when clusters are internally heterogeneous.
03
PPS Sampling & Ratio-Regression Estimation
Unequal probability sampling methods and auxiliary information for improved estimation.
PPS Sampling
Definition — PPS Sampling (With Replacement)
Each unit \(i\) is selected with probability \(p_i\) proportional to a known size measure \(z_i\): \(p_i = z_i / \sum z_j\). The Hansen-Hurwitz estimator of the population total \(Y\) based on \(n\) draws with replacement is:
\[\hat{Y}_{HH} = \frac{1}{n}\sum_{i=1}^n \frac{y_i}{p_i}, \quad V(\hat{Y}_{HH}) = \frac{1}{n}\sum_{i=1}^N p_i\left(\frac{Y_i}{p_i} - Y\right)^2\]
Theorem — Horvitz-Thompson Estimator
For any unequal probability sampling design without replacement, with inclusion probabilities \(\pi_i = P(\text{unit } i \in s)\), the Horvitz-Thompson estimator is:
\[\hat{Y}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i}\]
It is unbiased: \(E[\hat{Y}_{HT}] = Y\). Its variance is:
\[V(\hat{Y}_{HT}) = \sum_{i=1}^N \sum_{j=1}^N (\pi_{ij} - \pi_i\pi_j)\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}\]
where \(\pi_{ij}\) is the joint inclusion probability of units \(i\) and \(j\).
Ratio & Regression Estimation
Definition — Ratio Estimator
When an auxiliary variable \(x\) correlated with \(y\) is available and \(\bar{X}\) is known:
\[\hat{\bar{Y}}_R = \frac{\bar{y}}{\bar{x}} \cdot \bar{X} = R \cdot \bar{X}\]
The ratio estimator is biased but consistent. The approximate variance under SRSWOR is:
\[V(\hat{\bar{Y}}_R) \approx \frac{1-f}{n}\left(S_y^2 + R^2 S_x^2 - 2R S_{xy}\right)\]
where \(f = n/N\) and \(R = \bar{Y}/\bar{X}\).
Definition — Regression Estimator
The linear regression estimator is:
\[\hat{\bar{Y}}_{lr} = \bar{y} + b(\bar{X} - \bar{x})\]
where \(b = s_{xy}/s_x^2\) is the sample regression coefficient. The approximate variance is:
\[V(\hat{\bar{Y}}_{lr}) \approx \frac{1-f}{n} S_y^2(1 - \rho^2)\]
where \(\rho\) is the population correlation between \(x\) and \(y\). The regression estimator is always at least as efficient as the ratio estimator.
Example
In a survey, \(\bar{y} = 50\), \(\bar{x} = 40\), \(\bar{X} = 42\). Compute the ratio and regression estimates given \(b = 1.1\).
Ratio estimate: \(\hat{\bar{Y}}_R = (50/40) \times 42 = 1.25 \times 42 = 52.5\).
Regression estimate: \(\hat{\bar{Y}}_{lr} = 50 + 1.1(42 - 40) = 50 + 2.2 = 52.2\).
Regression estimate: \(\hat{\bar{Y}}_{lr} = 50 + 1.1(42 - 40) = 50 + 2.2 = 52.2\).
04
Completely Randomized & Block Designs
Classical experimental designs: CRD, RBD, and Latin Square Design with ANOVA.
CRD and One-Way ANOVA
Definition — Completely Randomized Design (CRD)
In a CRD, \(k\) treatments are randomly assigned to \(N\) experimental units. The linear model is:
\[Y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \quad i=1,\ldots,k,\; j=1,\ldots,n_i\]
where \(\tau_i\) is the \(i\)-th treatment effect and \(\varepsilon_{ij} \sim N(0, \sigma^2)\) i.i.d.
Theorem — ANOVA F-Test for CRD
The ANOVA decomposition is \(SS_T = SS_{Tr} + SS_E\) where:
\[SS_{Tr} = \sum_{i=1}^k n_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2, \quad SS_E = \sum_{i=1}^k\sum_{j=1}^{n_i}(Y_{ij} - \bar{Y}_{i\cdot})^2\]
Under \(H_0: \tau_1 = \cdots = \tau_k = 0\):
\[F = \frac{SS_{Tr}/(k-1)}{SS_E/(N-k)} \sim F_{k-1, N-k}\]
RBD and Latin Square Design
Definition — Randomized Block Design (RBD)
Experimental units are grouped into \(b\) homogeneous blocks. Each block receives all \(k\) treatments exactly once. The model is:
\[Y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}, \quad i=1,\ldots,k,\; j=1,\ldots,b\]
The ANOVA decomposition: \(SS_T = SS_{Tr} + SS_B + SS_E\) with \(df_E = (k-1)(b-1)\).
Theorem — Efficiency of RBD Relative to CRD
The relative efficiency of RBD to CRD is:
\[RE = \frac{(b-1)MS_B + b(k-1)MS_E}{(bk-1)MS_E}\]
When \(MS_B > MS_E\) (blocks are effective), \(RE > 1\) and RBD is more efficient.
Definition — Latin Square Design (LSD)
A \(k \times k\) Latin square simultaneously controls two blocking factors (rows and columns). Each treatment appears exactly once in each row and each column. The model is:
\[Y_{ij(l)} = \mu + \alpha_i + \beta_j + \tau_l + \varepsilon_{ij(l)}\]
where \(\alpha_i\) = row effect, \(\beta_j\) = column effect, \(\tau_l\) = treatment effect. Error df = \((k-1)(k-2)\).
Example
In an RBD with \(k = 4\) treatments and \(b = 5\) blocks, \(SS_{Tr} = 120\), \(SS_B = 80\), \(SS_E = 48\). Test for treatment effects at \(\alpha = 0.05\).
\(MS_{Tr} = 120/3 = 40\), \(MS_E = 48/12 = 4\). \(F = 40/4 = 10\). With \(df = (3, 12)\), \(F_{0.05,3,12} = 3.49\). Since \(10 > 3.49\), we reject \(H_0\) and conclude significant treatment differences.
05
BIBD & Factorial Experiments
Balanced incomplete block designs and \(2^k\) factorial experiments with confounding.
Balanced Incomplete Block Designs
Definition — BIBD Parameters
A Balanced Incomplete Block Design BIBD\((v, b, r, k, \lambda)\) is an arrangement of \(v\) treatments in \(b\) blocks, each of size \(k < v\), such that each treatment appears in \(r\) blocks and every pair of treatments appears together in \(\lambda\) blocks. The fundamental relations are:
\[bk = vr, \qquad \lambda(v-1) = r(k-1)\]
Fisher's inequality: \(b \ge v\). A BIBD with \(b = v\) (and hence \(r = k\)) is called symmetric.
Theorem — Intrablock Analysis of BIBD
The adjusted treatment sum of squares in a BIBD is:
\[SS_{Tr(adj)} = \frac{k}{\lambda v}\sum_{i=1}^v Q_i^2\]
where \(Q_i = T_i - \frac{1}{k}\sum_{j} n_{ij}B_j\) is the adjusted treatment total (\(T_i\) = treatment total, \(B_j\) = block total, \(n_{ij}\) = incidence). The variance of an adjusted treatment contrast is \(V(\hat{\tau}_i - \hat{\tau}_j) = \frac{2k\sigma^2}{\lambda v}\).
Example
Verify that a design with \(v = 4, b = 6, r = 3, k = 2, \lambda = 1\) satisfies the BIBD conditions.
Check: \(bk = 6 \times 2 = 12 = 4 \times 3 = vr\). Also \(\lambda(v-1) = 1 \times 3 = 3 = 3 \times 1 = r(k-1)\). Fisher's inequality: \(b = 6 \ge 4 = v\). All conditions are satisfied. This is the design with all \(\binom{4}{2} = 6\) pairs as blocks.
\(2^k\) Factorial Experiments
Definition — \(2^k\) Factorial Design
A \(2^k\) factorial experiment studies \(k\) factors, each at 2 levels. There are \(2^k\) treatment combinations. The main effect of factor \(A\) is:
\[A = \frac{1}{2^{k-1}}\left[\sum y(\text{A high}) - \sum y(\text{A low})\right]\]
The \(AB\) interaction measures the differential effect of \(A\) at the two levels of \(B\).
Definition — Confounding in \(2^k\) Designs
When block size is smaller than \(2^k\), certain effects are confounded with blocks. In a \(2^k\) design in \(2^p\) blocks, \(2^p - 1\) effects are confounded. The confounded effects are chosen to be highest-order interactions. The defining contrast subgroup determines the block structure.
For a \(2^3\) design in 2 blocks confounding \(ABC\): assign treatment combinations with an even number of letters from \(\{A, B, C\}\) to Block 1, and odd to Block 2.
For a \(2^3\) design in 2 blocks confounding \(ABC\): assign treatment combinations with an even number of letters from \(\{A, B, C\}\) to Block 1, and odd to Block 2.
Example
In a \(2^3\) factorial, the treatment totals are: \((1)=10\), \(a=30\), \(b=20\), \(ab=40\), \(c=15\), \(ac=35\), \(bc=25\), \(abc=45\). Calculate the main effect of \(A\).
Contrast for \(A\): \([A] = a - (1) + ab - b + ac - c + abc - bc = 30 - 10 + 40 - 20 + 35 - 15 + 45 - 25 = 80\).
Main effect: \(A = [A] / 2^{k-1} = 80 / 4 = 20\). The effect of changing factor A from low to high increases the response by 20 on average.
06
ANOVA & Multiple Comparisons
Analysis of variance for fixed, random, and mixed models with multiple comparison procedures.
Fixed, Random & Mixed Models
Definition — Fixed vs Random Effects
In the one-way model \(Y_{ij} = \mu + \tau_i + \varepsilon_{ij}\):
- Fixed effects: \(\tau_i\) are unknown constants with \(\sum \tau_i = 0\). Interest is in specific treatment means.
- Random effects: \(\tau_i \sim N(0, \sigma_\tau^2)\) are random variables. Interest is in variance components \(\sigma_\tau^2\).
- Mixed models: Some factors are fixed, others random. The expected mean squares determine the correct F-test denominators.
Theorem — Expected Mean Squares
In the one-way random effects model with balanced data (\(n\) observations per treatment):
\[E[MS_{Tr}] = \sigma^2 + n\sigma_\tau^2, \quad E[MS_E] = \sigma^2\]
Hence \(F = MS_{Tr}/MS_E\) tests \(H_0: \sigma_\tau^2 = 0\), and the ANOVA estimator of the variance component is \(\hat{\sigma}_\tau^2 = (MS_{Tr} - MS_E)/n\).
Multiple Comparison Procedures
Definition — Tukey's Honestly Significant Difference (HSD)
For pairwise comparison of \(k\) treatment means with equal sample sizes \(n\), two means are significantly different if:
\[|\bar{Y}_{i\cdot} - \bar{Y}_{j\cdot}| > q_{\alpha}(k, N-k)\sqrt{\frac{MS_E}{n}}\]
where \(q_{\alpha}(k, \nu)\) is the upper \(\alpha\) quantile of the Studentized range distribution. Tukey's HSD controls the family-wise error rate at \(\alpha\).
Definition — Duncan's Multiple Range Test
Duncan's test uses varying critical values depending on the number of means in the range. For means ranked \(\bar{Y}_{(1)} \le \cdots \le \bar{Y}_{(k)}\), compare:
\[\bar{Y}_{(j)} - \bar{Y}_{(i)} \text{ against } r_\alpha(p, \nu)\sqrt{MS_E/n}\]
where \(p = j - i + 1\) is the range. Duncan's test is more liberal than Tukey's HSD.
Example
In a CRD with \(k = 3\) treatments, \(n = 10\) per treatment, \(MS_E = 5.0\), and treatment means \(\bar{Y}_1 = 20\), \(\bar{Y}_2 = 24\), \(\bar{Y}_3 = 22\). Using \(q_{0.05}(3, 27) = 3.506\), perform Tukey's HSD.
HSD \(= 3.506\sqrt{5.0/10} = 3.506 \times 0.707 = 2.479\).
\(|\bar{Y}_2 - \bar{Y}_1| = 4 > 2.479\): significant.
\(|\bar{Y}_3 - \bar{Y}_1| = 2 < 2.479\): not significant.
\(|\bar{Y}_2 - \bar{Y}_3| = 2 < 2.479\): not significant.
Conclusion: Treatment 2 is significantly different from Treatment 1.
\(|\bar{Y}_2 - \bar{Y}_1| = 4 > 2.479\): significant.
\(|\bar{Y}_3 - \bar{Y}_1| = 2 < 2.479\): not significant.
\(|\bar{Y}_2 - \bar{Y}_3| = 2 < 2.479\): not significant.
Conclusion: Treatment 2 is significantly different from Treatment 1.
Key Takeaways
- SRSWOR is always more efficient than SRSWR due to the finite population correction factor \((1 - n/N)\).
- Neyman allocation minimizes variance by sampling more from strata with higher variability.
- The Horvitz-Thompson estimator is universally unbiased for any sampling design with known inclusion probabilities.
- Blocking (RBD, LSD) removes known sources of variation; BIBD handles cases where block size is smaller than the number of treatments.
- In \(2^k\) factorial experiments, confounding sacrifices information on high-order interactions to reduce block size.
- Tukey's HSD controls the family-wise error rate; use expected mean squares to identify the correct F-test denominator in mixed models.
Practice Problems
Problem 1
A population of \(N = 1000\) is divided into 2 strata with \(N_1 = 400, S_1 = 8\) and \(N_2 = 600, S_2 = 12\). For \(n = 80\), compare the variances under proportional and Neyman allocation.
Show Solution ▼
Proportional: \(n_1 = 32, n_2 = 48\). \(V_{prop} = \frac{0.92}{80}(0.4 \times 64 + 0.6 \times 144) = \frac{0.92}{80}(25.6 + 86.4) = \frac{0.92 \times 112}{80} = 1.288\).
Neyman: \(\sum N_h S_h = 400(8) + 600(12) = 10400\). \(n_1 = 80 \times 3200/10400 \approx 25, n_2 \approx 55\). \(V_{opt} = \frac{1}{80}(0.4 \times 8 + 0.6 \times 12)^2 = \frac{10.4^2}{80} = 1.352\) (before subtracting the correction). The Neyman allocation gives lower variance than proportional when stratum variances differ.
Neyman: \(\sum N_h S_h = 400(8) + 600(12) = 10400\). \(n_1 = 80 \times 3200/10400 \approx 25, n_2 \approx 55\). \(V_{opt} = \frac{1}{80}(0.4 \times 8 + 0.6 \times 12)^2 = \frac{10.4^2}{80} = 1.352\) (before subtracting the correction). The Neyman allocation gives lower variance than proportional when stratum variances differ.
Problem 2
In a PPS with replacement sample of size \(n = 5\) from a population of 4 units with sizes \(z = (10, 20, 30, 40)\), unit 2 is selected with values \(y_2 = 50\). What is its contribution to the Hansen-Hurwitz estimator?
Show Solution ▼
\(p_2 = 20/100 = 0.2\). Contribution = \(y_2 / p_2 = 50 / 0.2 = 250\). The HH estimator averages all such contributions: \(\hat{Y}_{HH} = \frac{1}{n}\sum_{i=1}^n y_i/p_i\).
Problem 3
Does a BIBD exist with parameters \(v = 6, k = 3, \lambda = 1\)? If so, find \(b\) and \(r\).
Show Solution ▼
From \(\lambda(v-1) = r(k-1)\): \(1 \times 5 = r \times 2\), so \(r = 5/2\), which is not an integer. Therefore, no such BIBD exists. The necessary conditions are not sufficient, but here a necessary condition already fails.
Problem 4
In a \(2^2\) factorial experiment with two replicates, the treatment totals are \((1) = 20\), \(a = 40\), \(b = 30\), \(ab = 60\). Find the main effects and interaction.
Show Solution ▼
Contrasts: \([A] = a - (1) + ab - b = 40 - 20 + 60 - 30 = 50\). \([B] = b - (1) + ab - a = 30 - 20 + 60 - 40 = 30\). \([AB] = (1) - a - b + ab = 20 - 40 - 30 + 60 = 10\).
With \(2^k \cdot r = 4 \times 2 = 8\) total observations: effects = contrast / \(2^{k-1} \cdot r\) = contrast / 4. So \(A = 12.5\), \(B = 7.5\), \(AB = 2.5\).
With \(2^k \cdot r = 4 \times 2 = 8\) total observations: effects = contrast / \(2^{k-1} \cdot r\) = contrast / 4. So \(A = 12.5\), \(B = 7.5\), \(AB = 2.5\).
Problem 5
In a random effects one-way ANOVA with \(k = 5\) groups and \(n = 10\) per group, \(MS_{Tr} = 48\) and \(MS_E = 8\). Estimate the variance component \(\sigma_\tau^2\) and test \(H_0: \sigma_\tau^2 = 0\).
Show Solution ▼
\(\hat{\sigma}_\tau^2 = (MS_{Tr} - MS_E)/n = (48 - 8)/10 = 4\). \(F = 48/8 = 6\) with \(df = (4, 45)\). Since \(F_{0.05, 4, 45} \approx 2.58\), we reject \(H_0\) and conclude that the random treatment variance component is significantly positive.
Self-Assessment Quiz
1. In SRSWOR, the finite population correction factor is:
A \(n/N\)
B \(1 - n/N\)
C \(N/(N-n)\)
D \((N-1)/N\)
2. Neyman allocation assigns larger samples to strata with:
A Larger population sizes only
B Larger products \(N_h S_h\)
C Higher stratum means
D Equal sample sizes across all strata
3. The Horvitz-Thompson estimator requires knowledge of:
A Population mean
B Population variance
C Inclusion probabilities \(\pi_i\)
D Auxiliary variable values
4. In a BIBD with \(v = 7, k = 3, \lambda = 1\), the number of blocks is:
A 7
B 14
C 21
D 10
5. In a random effects one-way ANOVA, \(E[MS_{Tr}]\) equals:
A \(\sigma^2\)
B \(\sigma^2 + n\sigma_\tau^2\)
C \(\sigma^2 + n\sum\tau_i^2/(k-1)\)
D \(\sigma^2 + (n-1)\sigma_\tau^2\)