← Back to CSIR NET

Sampling & Experimental Design

A comprehensive study of survey sampling techniques and experimental design methods — from simple random sampling to factorial experiments and ANOVA.

Simple Random Sampling Stratified & Systematic PPS & Ratio-Regression CRD & Block Designs BIBD & Factorial ANOVA & Comparisons

0 of 6 units completed0%

Simple Random Sampling

Fundamental sampling technique: SRS with and without replacement, estimation of population parameters, and sample size determination.

🎯 SRS Framework

Definition — Simple Random Sampling Without Replacement (SRSWOR) A sample of size \(n\) drawn from a population of \(N\) units such that every subset of size \(n\) has equal probability \(\binom{N}{n}^{-1}\) of being selected. The inclusion probability for each unit is \(\pi_i = n/N\).

Theorem — Estimation of Population Mean Under SRSWOR, the sample mean \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i\) is an unbiased estimator of the population mean \(\bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i\): \[E[\bar{y}] = \bar{Y}, \qquad V(\bar{y}) = \frac{S^2}{n}\left(1 - \frac{n}{N}\right)\] where \(S^2 = \frac{1}{N-1}\sum_{i=1}^N(Y_i - \bar{Y})^2\) and the factor \((1 - n/N)\) is the finite population correction (fpc). An unbiased estimator of \(V(\bar{y})\) is: \[\hat{V}(\bar{y}) = \frac{s^2}{n}\left(1 - \frac{n}{N}\right), \quad s^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i - \bar{y})^2\]

Definition — SRS With Replacement (SRSWR) Each draw selects a unit with probability \(1/N\), independently. The variance of \(\bar{y}\) under SRSWR is: \[V_{\text{WR}}(\bar{y}) = \frac{\sigma^2}{n}, \quad \sigma^2 = \frac{1}{N}\sum_{i=1}^N(Y_i - \bar{Y})^2\] Note: \(V_{\text{WOR}}(\bar{y}) \le V_{\text{WR}}(\bar{y})\) always, so SRSWOR is more efficient.

Theorem — Sample Size Determination To estimate \(\bar{Y}\) with margin of error \(e\) at confidence level \(1-\alpha\): \[n = \frac{N z_{\alpha/2}^2 S^2}{(N-1)e^2 + z_{\alpha/2}^2 S^2}\] For proportions, replace \(S^2\) with \(P(1-P)\) where \(P\) is the population proportion.

Example

A population of \(N = 500\) has \(S^2 = 25\). Find the sample size needed to estimate \(\bar{Y}\) within \(\pm 1\) at 95% confidence.

Using \(z_{0.025} = 1.96\): \(n = \frac{500 \times (1.96)^2 \times 25}{499 \times 1 + (1.96)^2 \times 25} = \frac{500 \times 3.8416 \times 25}{499 + 96.04} = \frac{48020}{595.04} \approx 80.7\). Take \(n = 81\).

Stratified & Systematic Sampling

Stratification for variance reduction, allocation methods, and systematic sampling schemes.

📊 Stratified Sampling

Definition — Stratified Random Sampling The population of \(N\) units is divided into \(L\) non-overlapping strata of sizes \(N_1, N_2, \ldots, N_L\) with \(\sum N_h = N\). An independent SRS of size \(n_h\) is drawn from stratum \(h\). The stratified estimator of \(\bar{Y}\) is: \[\bar{y}_{st} = \sum_{h=1}^L W_h \bar{y}_h, \quad W_h = \frac{N_h}{N}\] with \(V(\bar{y}_{st}) = \sum_{h=1}^L W_h^2 \frac{S_h^2}{n_h}\left(1 - \frac{n_h}{N_h}\right)\).

Theorem — Optimal (Neyman) Allocation The variance \(V(\bar{y}_{st})\) is minimized for fixed total sample size \(n = \sum n_h\) when: \[n_h = n \cdot \frac{N_h S_h}{\sum_{k=1}^L N_k S_k}\] Under Neyman allocation: \(V_{\text{opt}}(\bar{y}_{st}) = \frac{1}{n}\left(\sum_{h=1}^L W_h S_h\right)^2 - \frac{1}{N}\sum_{h=1}^L W_h S_h^2\).

If costs vary by stratum with per-unit cost \(c_h\), the cost-optimal allocation is \(n_h \propto N_h S_h / \sqrt{c_h}\).

Definition — Proportional Allocation \(n_h = n \cdot W_h = n \cdot N_h / N\). Under proportional allocation: \[V_{\text{prop}}(\bar{y}_{st}) = \frac{1}{n}\left(1 - \frac{n}{N}\right)\sum_{h=1}^L W_h S_h^2\] This always improves upon SRS when strata means differ.

Example

A population has 3 strata: \(N_1=200, N_2=300, N_3=500\) with \(S_1=10, S_2=15, S_3=20\). For \(n=100\), find the Neyman allocation.

\(\sum N_h S_h = 200(10) + 300(15) + 500(20) = 2000 + 4500 + 10000 = 16500\). \(n_1 = 100 \times 2000/16500 \approx 12\), \(n_2 = 100 \times 4500/16500 \approx 27\), \(n_3 = 100 \times 10000/16500 \approx 61\).

🔄 Systematic & Cluster Sampling

Definition — Systematic Sampling For a population of \(N = nk\) units, select a random start \(r \in \{1, 2, \ldots, k\}\) and include units \(r, r+k, r+2k, \ldots, r+(n-1)k\). The estimator \(\bar{y}_{sys}\) is unbiased for \(\bar{Y}\) but its variance depends on the intraclass correlation: \[V(\bar{y}_{sys}) = \frac{S^2}{n}\left(1 + (n-1)\rho_{sys}\right)\frac{N-1}{N}\] where \(\rho_{sys}\) measures the correlation within systematic samples. When the population is in random order, systematic sampling approximates SRSWOR.

Definition — Cluster Sampling The population is divided into \(M\) clusters. A sample of \(m\) clusters is selected by SRS, and all units within selected clusters are observed. The unbiased estimator of the population total is: \[\hat{Y}_{cl} = \frac{M}{m}\sum_{i=1}^m Y_i\] where \(Y_i\) is the cluster total. Cluster sampling is less efficient than SRS when clusters are internally heterogeneous.

PPS Sampling & Ratio-Regression Estimation

Unequal probability sampling methods and auxiliary information for improved estimation.

📐 PPS Sampling

Definition — PPS Sampling (With Replacement) Each unit \(i\) is selected with probability \(p_i\) proportional to a known size measure \(z_i\): \(p_i = z_i / \sum z_j\). The Hansen-Hurwitz estimator of the population total \(Y\) based on \(n\) draws with replacement is: \[\hat{Y}_{HH} = \frac{1}{n}\sum_{i=1}^n \frac{y_i}{p_i}, \quad V(\hat{Y}_{HH}) = \frac{1}{n}\sum_{i=1}^N p_i\left(\frac{Y_i}{p_i} - Y\right)^2\]

Theorem — Horvitz-Thompson Estimator For any unequal probability sampling design without replacement, with inclusion probabilities \(\pi_i = P(\text{unit } i \in s)\), the Horvitz-Thompson estimator is: \[\hat{Y}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i}\] It is unbiased: \(E[\hat{Y}_{HT}] = Y\). Its variance is: \[V(\hat{Y}_{HT}) = \sum_{i=1}^N \sum_{j=1}^N (\pi_{ij} - \pi_i\pi_j)\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}\] where \(\pi_{ij}\) is the joint inclusion probability of units \(i\) and \(j\).

📈 Ratio & Regression Estimation

Definition — Ratio Estimator When an auxiliary variable \(x\) correlated with \(y\) is available and \(\bar{X}\) is known: \[\hat{\bar{Y}}_R = \frac{\bar{y}}{\bar{x}} \cdot \bar{X} = R \cdot \bar{X}\] The ratio estimator is biased but consistent. The approximate variance under SRSWOR is: \[V(\hat{\bar{Y}}_R) \approx \frac{1-f}{n}\left(S_y^2 + R^2 S_x^2 - 2R S_{xy}\right)\] where \(f = n/N\) and \(R = \bar{Y}/\bar{X}\).

Definition — Regression Estimator The linear regression estimator is: \[\hat{\bar{Y}}_{lr} = \bar{y} + b(\bar{X} - \bar{x})\] where \(b = s_{xy}/s_x^2\) is the sample regression coefficient. The approximate variance is: \[V(\hat{\bar{Y}}_{lr}) \approx \frac{1-f}{n} S_y^2(1 - \rho^2)\] where \(\rho\) is the population correlation between \(x\) and \(y\). The regression estimator is always at least as efficient as the ratio estimator.

Example

In a survey, \(\bar{y} = 50\), \(\bar{x} = 40\), \(\bar{X} = 42\). Compute the ratio and regression estimates given \(b = 1.1\).

Ratio estimate: \(\hat{\bar{Y}}_R = (50/40) \times 42 = 1.25 \times 42 = 52.5\).
Regression estimate: \(\hat{\bar{Y}}_{lr} = 50 + 1.1(42 - 40) = 50 + 2.2 = 52.2\).

Completely Randomized & Block Designs

Classical experimental designs: CRD, RBD, and Latin Square Design with ANOVA.

🔬 CRD and One-Way ANOVA

Definition — Completely Randomized Design (CRD) In a CRD, \(k\) treatments are randomly assigned to \(N\) experimental units. The linear model is: \[Y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \quad i=1,\ldots,k,\; j=1,\ldots,n_i\] where \(\tau_i\) is the \(i\)-th treatment effect and \(\varepsilon_{ij} \sim N(0, \sigma^2)\) i.i.d.

Theorem — ANOVA F-Test for CRD The ANOVA decomposition is \(SS_T = SS_{Tr} + SS_E\) where: \[SS_{Tr} = \sum_{i=1}^k n_i(\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2, \quad SS_E = \sum_{i=1}^k\sum_{j=1}^{n_i}(Y_{ij} - \bar{Y}_{i\cdot})^2\] Under \(H_0: \tau_1 = \cdots = \tau_k = 0\): \[F = \frac{SS_{Tr}/(k-1)}{SS_E/(N-k)} \sim F_{k-1, N-k}\]

🧱 RBD and Latin Square Design

Definition — Randomized Block Design (RBD) Experimental units are grouped into \(b\) homogeneous blocks. Each block receives all \(k\) treatments exactly once. The model is: \[Y_{ij} = \mu + \tau_i + \beta_j + \varepsilon_{ij}, \quad i=1,\ldots,k,\; j=1,\ldots,b\] The ANOVA decomposition: \(SS_T = SS_{Tr} + SS_B + SS_E\) with \(df_E = (k-1)(b-1)\).

Theorem — Efficiency of RBD Relative to CRD The relative efficiency of RBD to CRD is: \[RE = \frac{(b-1)MS_B + b(k-1)MS_E}{(bk-1)MS_E}\] When \(MS_B > MS_E\) (blocks are effective), \(RE > 1\) and RBD is more efficient.

Definition — Latin Square Design (LSD) A \(k \times k\) Latin square simultaneously controls two blocking factors (rows and columns). Each treatment appears exactly once in each row and each column. The model is: \[Y_{ij(l)} = \mu + \alpha_i + \beta_j + \tau_l + \varepsilon_{ij(l)}\] where \(\alpha_i\) = row effect, \(\beta_j\) = column effect, \(\tau_l\) = treatment effect. Error df = \((k-1)(k-2)\).

Example

In an RBD with \(k = 4\) treatments and \(b = 5\) blocks, \(SS_{Tr} = 120\), \(SS_B = 80\), \(SS_E = 48\). Test for treatment effects at \(\alpha = 0.05\).

\(MS_{Tr} = 120/3 = 40\), \(MS_E = 48/12 = 4\). \(F = 40/4 = 10\). With \(df = (3, 12)\), \(F_{0.05,3,12} = 3.49\). Since \(10 > 3.49\), we reject \(H_0\) and conclude significant treatment differences.

BIBD & Factorial Experiments

Balanced incomplete block designs and \(2^k\) factorial experiments with confounding.

🔷 Balanced Incomplete Block Designs

Definition — BIBD Parameters A Balanced Incomplete Block Design BIBD\((v, b, r, k, \lambda)\) is an arrangement of \(v\) treatments in \(b\) blocks, each of size \(k < v\), such that each treatment appears in \(r\) blocks and every pair of treatments appears together in \(\lambda\) blocks. The fundamental relations are: \[bk = vr, \qquad \lambda(v-1) = r(k-1)\] Fisher's inequality: \(b \ge v\). A BIBD with \(b = v\) (and hence \(r = k\)) is called symmetric.

Theorem — Intrablock Analysis of BIBD The adjusted treatment sum of squares in a BIBD is: \[SS_{Tr(adj)} = \frac{k}{\lambda v}\sum_{i=1}^v Q_i^2\] where \(Q_i = T_i - \frac{1}{k}\sum_{j} n_{ij}B_j\) is the adjusted treatment total (\(T_i\) = treatment total, \(B_j\) = block total, \(n_{ij}\) = incidence). The variance of an adjusted treatment contrast is \(V(\hat{\tau}_i - \hat{\tau}_j) = \frac{2k\sigma^2}{\lambda v}\).

Example

Verify that a design with \(v = 4, b = 6, r = 3, k = 2, \lambda = 1\) satisfies the BIBD conditions.

Check: \(bk = 6 \times 2 = 12 = 4 \times 3 = vr\). Also \(\lambda(v-1) = 1 \times 3 = 3 = 3 \times 1 = r(k-1)\). Fisher's inequality: \(b = 6 \ge 4 = v\). All conditions are satisfied. This is the design with all \(\binom{4}{2} = 6\) pairs as blocks.

🧪 \(2^k\) Factorial Experiments

Definition — \(2^k\) Factorial Design A \(2^k\) factorial experiment studies \(k\) factors, each at 2 levels. There are \(2^k\) treatment combinations. The main effect of factor \(A\) is: \[A = \frac{1}{2^{k-1}}\left[\sum y(\text{A high}) - \sum y(\text{A low})\right]\] The \(AB\) interaction measures the differential effect of \(A\) at the two levels of \(B\).

Definition — Confounding in \(2^k\) Designs When block size is smaller than \(2^k\), certain effects are confounded with blocks. In a \(2^k\) design in \(2^p\) blocks, \(2^p - 1\) effects are confounded. The confounded effects are chosen to be highest-order interactions. The defining contrast subgroup determines the block structure.

For a \(2^3\) design in 2 blocks confounding \(ABC\): assign treatment combinations with an even number of letters from \(\{A, B, C\}\) to Block 1, and odd to Block 2.

Example

In a \(2^3\) factorial, the treatment totals are: \((1)=10\), \(a=30\), \(b=20\), \(ab=40\), \(c=15\), \(ac=35\), \(bc=25\), \(abc=45\). Calculate the main effect of \(A\).

Contrast for \(A\): \([A] = a - (1) + ab - b + ac - c + abc - bc = 30 - 10 + 40 - 20 + 35 - 15 + 45 - 25 = 80\). Main effect: \(A = [A] / 2^{k-1} = 80 / 4 = 20\). The effect of changing factor A from low to high increases the response by 20 on average.

ANOVA & Multiple Comparisons

Analysis of variance for fixed, random, and mixed models with multiple comparison procedures.

📋 Fixed, Random & Mixed Models

Definition — Fixed vs Random Effects In the one-way model \(Y_{ij} = \mu + \tau_i + \varepsilon_{ij}\):

Fixed effects: \(\tau_i\) are unknown constants with \(\sum \tau_i = 0\). Interest is in specific treatment means.
Random effects: \(\tau_i \sim N(0, \sigma_\tau^2)\) are random variables. Interest is in variance components \(\sigma_\tau^2\).
Mixed models: Some factors are fixed, others random. The expected mean squares determine the correct F-test denominators.

Theorem — Expected Mean Squares In the one-way random effects model with balanced data (\(n\) observations per treatment): \[E[MS_{Tr}] = \sigma^2 + n\sigma_\tau^2, \quad E[MS_E] = \sigma^2\] Hence \(F = MS_{Tr}/MS_E\) tests \(H_0: \sigma_\tau^2 = 0\), and the ANOVA estimator of the variance component is \(\hat{\sigma}_\tau^2 = (MS_{Tr} - MS_E)/n\).

🔍 Multiple Comparison Procedures

Definition — Tukey's Honestly Significant Difference (HSD) For pairwise comparison of \(k\) treatment means with equal sample sizes \(n\), two means are significantly different if: \[|\bar{Y}_{i\cdot} - \bar{Y}_{j\cdot}| > q_{\alpha}(k, N-k)\sqrt{\frac{MS_E}{n}}\] where \(q_{\alpha}(k, \nu)\) is the upper \(\alpha\) quantile of the Studentized range distribution. Tukey's HSD controls the family-wise error rate at \(\alpha\).

Definition — Duncan's Multiple Range Test Duncan's test uses varying critical values depending on the number of means in the range. For means ranked \(\bar{Y}_{(1)} \le \cdots \le \bar{Y}_{(k)}\), compare: \[\bar{Y}_{(j)} - \bar{Y}_{(i)} \text{ against } r_\alpha(p, \nu)\sqrt{MS_E/n}\] where \(p = j - i + 1\) is the range. Duncan's test is more liberal than Tukey's HSD.

Example

In a CRD with \(k = 3\) treatments, \(n = 10\) per treatment, \(MS_E = 5.0\), and treatment means \(\bar{Y}_1 = 20\), \(\bar{Y}_2 = 24\), \(\bar{Y}_3 = 22\). Using \(q_{0.05}(3, 27) = 3.506\), perform Tukey's HSD.

HSD \(= 3.506\sqrt{5.0/10} = 3.506 \times 0.707 = 2.479\).
\(|\bar{Y}_2 - \bar{Y}_1| = 4 > 2.479\): significant.
\(|\bar{Y}_3 - \bar{Y}_1| = 2 < 2.479\): not significant.
\(|\bar{Y}_2 - \bar{Y}_3| = 2 < 2.479\): not significant.
Conclusion: Treatment 2 is significantly different from Treatment 1.

Key Takeaways

SRSWOR is always more efficient than SRSWR due to the finite population correction factor \((1 - n/N)\).
Neyman allocation minimizes variance by sampling more from strata with higher variability.
The Horvitz-Thompson estimator is universally unbiased for any sampling design with known inclusion probabilities.
Blocking (RBD, LSD) removes known sources of variation; BIBD handles cases where block size is smaller than the number of treatments.
In \(2^k\) factorial experiments, confounding sacrifices information on high-order interactions to reduce block size.
Tukey's HSD controls the family-wise error rate; use expected mean squares to identify the correct F-test denominator in mixed models.

Practice Problems

Problem 1

A population of \(N = 1000\) is divided into 2 strata with \(N_1 = 400, S_1 = 8\) and \(N_2 = 600, S_2 = 12\). For \(n = 80\), compare the variances under proportional and Neyman allocation.

Show Solution ▼

Proportional: \(n_1 = 32, n_2 = 48\). \(V_{prop} = \frac{0.92}{80}(0.4 \times 64 + 0.6 \times 144) = \frac{0.92}{80}(25.6 + 86.4) = \frac{0.92 \times 112}{80} = 1.288\).
Neyman: \(\sum N_h S_h = 400(8) + 600(12) = 10400\). \(n_1 = 80 \times 3200/10400 \approx 25, n_2 \approx 55\). \(V_{opt} = \frac{1}{80}(0.4 \times 8 + 0.6 \times 12)^2 = \frac{10.4^2}{80} = 1.352\) (before subtracting the correction). The Neyman allocation gives lower variance than proportional when stratum variances differ.

Problem 2

In a PPS with replacement sample of size \(n = 5\) from a population of 4 units with sizes \(z = (10, 20, 30, 40)\), unit 2 is selected with values \(y_2 = 50\). What is its contribution to the Hansen-Hurwitz estimator?

Show Solution ▼

\(p_2 = 20/100 = 0.2\). Contribution = \(y_2 / p_2 = 50 / 0.2 = 250\). The HH estimator averages all such contributions: \(\hat{Y}_{HH} = \frac{1}{n}\sum_{i=1}^n y_i/p_i\).

Problem 3

Does a BIBD exist with parameters \(v = 6, k = 3, \lambda = 1\)? If so, find \(b\) and \(r\).

Show Solution ▼

From \(\lambda(v-1) = r(k-1)\): \(1 \times 5 = r \times 2\), so \(r = 5/2\), which is not an integer. Therefore, no such BIBD exists. The necessary conditions are not sufficient, but here a necessary condition already fails.

Problem 4

In a \(2^2\) factorial experiment with two replicates, the treatment totals are \((1) = 20\), \(a = 40\), \(b = 30\), \(ab = 60\). Find the main effects and interaction.

Show Solution ▼

Contrasts: \([A] = a - (1) + ab - b = 40 - 20 + 60 - 30 = 50\). \([B] = b - (1) + ab - a = 30 - 20 + 60 - 40 = 30\). \([AB] = (1) - a - b + ab = 20 - 40 - 30 + 60 = 10\).
With \(2^k \cdot r = 4 \times 2 = 8\) total observations: effects = contrast / \(2^{k-1} \cdot r\) = contrast / 4. So \(A = 12.5\), \(B = 7.5\), \(AB = 2.5\).

Problem 5

In a random effects one-way ANOVA with \(k = 5\) groups and \(n = 10\) per group, \(MS_{Tr} = 48\) and \(MS_E = 8\). Estimate the variance component \(\sigma_\tau^2\) and test \(H_0: \sigma_\tau^2 = 0\).

Show Solution ▼

\(\hat{\sigma}_\tau^2 = (MS_{Tr} - MS_E)/n = (48 - 8)/10 = 4\). \(F = 48/8 = 6\) with \(df = (4, 45)\). Since \(F_{0.05, 4, 45} \approx 2.58\), we reject \(H_0\) and conclude that the random treatment variance component is significantly positive.

Self-Assessment Quiz

1. In SRSWOR, the finite population correction factor is:

A \(n/N\)

B \(1 - n/N\)

C \(N/(N-n)\)

D \((N-1)/N\)

2. Neyman allocation assigns larger samples to strata with:

A Larger population sizes only

B Larger products \(N_h S_h\)

C Higher stratum means

D Equal sample sizes across all strata

3. The Horvitz-Thompson estimator requires knowledge of:

A Population mean

B Population variance

C Inclusion probabilities \(\pi_i\)

D Auxiliary variable values

4. In a BIBD with \(v = 7, k = 3, \lambda = 1\), the number of blocks is:

A 7

B 14

C 21

D 10

5. In a random effects one-way ANOVA, \(E[MS_{Tr}]\) equals:

A \(\sigma^2\)

B \(\sigma^2 + n\sigma_\tau^2\)

C \(\sigma^2 + n\sum\tau_i^2/(k-1)\)

D \(\sigma^2 + (n-1)\sigma_\tau^2\)