Fisher's_noncentral_hypergeometric_distribution

Multivariate Fisher's Noncentral Hypergeometric Distribution
Parameters	; ; ; ;
Support
PMF	; where
Mean	The mean μi of xi can be approximated by ; where r is the unique positive solution to .

Univariate Fisher's noncentral hypergeometric distribution
Parameters	; ; ;
Support	; ;
PMF	; where
Mean	, where
Mode	, where , , .
Variance	, where Pk is given above.

Fisher's noncentral hypergeometric distribution

Add article description

In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional distribution of two or more binomially distributed variables dependent upon their fixed sum.

Probability mass function for Fisher's noncentral hypergeometric distribution for different values of the odds ratio ω.
m₁ = 80, m₂ = 60, n = 100, ω = 0.01, ..., 1000

Biologist and statistician Ronald Fisher

The distribution may be illustrated by the following urn model. Assume, for example, that an urn contains m₁ red balls and m₂ white balls, totalling N = m₁ + m₂ balls. Each red ball has the weight ω₁ and each white ball has the weight ω₂. We will say that the odds ratio is ω = ω₁ / ω₂. Now we are taking balls randomly in such a way that the probability of taking a particular ball is proportional to its weight, but independent of what happens to the other balls. The number of balls taken of a particular color follows the binomial distribution. If the total number n of balls taken is known then the conditional distribution of the number of taken red balls for given n is Fisher's noncentral hypergeometric distribution. To generate this distribution experimentally, we have to repeat the experiment until it happens to give n balls.

If we want to fix the value of n prior to the experiment then we have to take the balls one by one until we have n balls. The balls are therefore no longer independent. This gives a slightly different distribution known as Wallenius' noncentral hypergeometric distribution. It is far from obvious why these two distributions are different. See the entry for noncentral hypergeometric distributions for an explanation of the difference between these two distributions and a discussion of which distribution to use in various situations.

The two distributions are both equal to the (central) hypergeometric distribution when the odds ratio is 1.

Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name.

Fisher's noncentral hypergeometric distribution was first given the name extended hypergeometric distribution (Harkness, 1965), and some authors still use this name today.

Univariate distribution

Quick Facts Parameters, Support ...

The probability function, mean and variance are given in the adjacent table.

An alternative expression of the distribution has both the number of balls taken of each color and the number of balls not taken as random variables, whereby the expression for the probability becomes symmetric.

The calculation time for the probability function can be high when the sum in P₀ has many terms. The calculation time can be reduced by calculating the terms in the sum recursively relative to the term for y = x and ignoring negligible terms in the tails (Liao and Rosen, 2001).

The mean can be approximated by:

\mu \approx {\frac {-2c}{b-{\sqrt {b^{2}-4ac}}}}\,

,

where $a=\omega -1$ , $b=m_{1}+n-N-(m_{1}+n)\omega$ , $c=m_{1}n\omega$ .

The variance can be approximated by:

\sigma ^{2}\approx {\frac {N}{N-1}}{\bigg /}\left({\frac {1}{\mu }}+{\frac {1}{m_{1}-\mu }}+{\frac {1}{n-\mu }}+{\frac {1}{\mu +m_{2}-n}}\right)

.

Better approximations to the mean and variance are given by Levin (1984, 1990), McCullagh and Nelder (1989), Liao (1992), and Eisinga and Pelzer (2011). The saddlepoint methods to approximate the mean and the variance suggested Eisinga and Pelzer (2011) offer extremely accurate results.

Properties

The following symmetry relations apply:

\operatorname {fnchypg} (x;n,m_{1},N,\omega )=\operatorname {fnchypg} (n-x;n,m_{2},N,1/\omega )\,.

\operatorname {fnchypg} (x;n,m_{1},N,\omega )=\operatorname {fnchypg} (x;m_{1},n,N,\omega )\,.

\operatorname {fnchypg} (x;n,m_{1},N,\omega )=\operatorname {fnchypg} (m_{1}-x;N-n,m_{1},N,1/\omega )\,.

Recurrence relation:

\operatorname {fnchypg} (x;n,m_{1},N,\omega )=\operatorname {fnchypg} (x-1;n,m_{1},N,\omega ){\frac {(m_{1}-x+1)(n-x+1)}{x(m_{2}-n+x)}}\omega \,.

The distribution is affectionately called "finchy-pig," based on the abbreviation convention above.

Derivation

The univariate noncentral hypergeometric distribution may be derived alternatively as a conditional distribution in the context of two binomially distributed random variables, for example when considering the response to a particular treatment in two different groups of patients participating in a clinical trial. An important application of the noncentral hypergeometric distribution in this context is the computation of exact confidence intervals for the odds ratio comparing treatment response between the two groups.

Suppose X and Y are binomially distributed random variables counting the number of responders in two corresponding groups of size m_X and m_Y respectively,

X\sim \operatorname {Bin} (m_{X},\pi _{X}),\quad Y\sim \operatorname {Bin} (m_{Y},\pi _{Y})\,

.

Their odds ratio is given as

\omega ={\frac {\omega _{X}}{\omega _{Y}}}={\frac {\pi _{X}/(1-\pi _{X})}{\pi _{Y}/(1-\pi _{Y})}}

.

The responder prevalence $\pi _{i}$ is fully defined in terms of the odds $\omega _{i}$ , $i\in \{X,Y\}$ , which correspond to the sampling bias in the urn scheme above, i.e.

\pi _{i}={\frac {\omega _{i}}{1+\omega _{i}}}

.

The trial can be summarized and analyzed in terms of the following contingency table.

More information Treatment Group, responder ...

In the table, $n=x+y$ corresponds to the total number of responders across groups, and N to the total number of patients recruited into the trial. The dots denote corresponding frequency counts of no further relevance.

The sampling distribution of responders in group X conditional upon the trial outcome and prevalences, $Pr(X=x\;|\;X+Y=n,m_{X},m_{Y},\omega _{X},\omega _{Y})$ , is noncentral hypergeometric:

${\begin{aligned}F(X,\omega ):&=Pr(X=x\;|\;X+Y=n,m_{X},m_{Y},\omega _{X},\omega _{Y})\\&={\frac {Pr(X=x,X+Y=n\;|\;m_{X},m_{Y},\omega _{X},\omega _{Y})}{Pr(X+Y=n\;|\;m_{X},m_{Y},\omega _{X},\omega _{Y})}}\\&={\frac {Pr(X=x\;|\;m_{X},\omega _{X})Pr(Y=n-x\;|\;m_{Y},\omega _{Y},X=x)}{Pr(X+Y=n\;|\;m_{X},m_{Y},\omega _{X},\omega _{Y})}}\\&={\frac {{\binom {m_{X}}{x}}\pi _{X}^{x}(1-\pi _{X})^{m_{X}-x}{\binom {m_{Y}}{n-x}}\pi _{Y}^{n-x}(1-\pi _{Y})^{m_{Y}-(n-x)}}{Pr(X+Y=n\;|\;m_{X},m_{Y},\omega _{X},\omega _{Y})}}\\&={\frac {{\binom {m_{X}}{x}}\omega _{X}^{x}(1-\pi _{X})^{m_{X}}{\binom {m_{Y}}{n-x}}\omega _{Y}^{n-x}(1-\pi _{Y})^{m_{Y}}}{Pr(X+Y=n\;|\;m_{X},m_{Y},\omega _{X},\omega _{Y})}}\\&={\frac {{\binom {m_{X}}{x}}{\binom {m_{Y}}{n-x}}\omega ^{x}(1-\pi _{X})^{m_{X}}\omega _{Y}^{n}(1-\pi _{Y})^{m_{Y}}}{(1-\pi _{X})^{m_{X}}\omega _{Y}^{n}(1-\pi _{Y})^{m_{Y}}\sum _{u=\max(0,n-m_{Y})}^{\min(m_{X},n)}{\binom {m_{X}}{u}}{\binom {m_{Y}}{n-u}}\omega ^{u}}}\\&={\frac {{\binom {m_{X}}{x}}{\binom {m_{Y}}{n-x}}\omega ^{x}}{\sum _{u=\max(0,n-m_{Y})}^{\min(m_{X},n)}{\binom {m_{X}}{u}}{\binom {m_{Y}}{n-u}}\omega ^{u}}}\end{aligned}}$

Note that the denominator is essentially just the numerator, summed over all events of the joint sample space $(X,Y)$ for which it holds that $X+Y=n$ . Terms independent of X can be factored out of the sum and cancel out with the numerator.

Multivariate distribution

Quick Facts Parameters, Support ...

The distribution can be expanded to any number of colors c of balls in the urn. The multivariate distribution is used when there are more than two colors.

The probability function and a simple approximation to the mean are given to the right. Better approximations to the mean and variance are given by McCullagh and Nelder (1989).

Properties

The order of the colors is arbitrary so that any colors can be swapped.

The weights can be arbitrarily scaled:

\operatorname {mfnchypg} (\mathbf {x} ;n,\mathbf {m} ,{\boldsymbol {\omega }})=\operatorname {mfnchypg} (\mathbf {x} ;n,\mathbf {m} ,r{\boldsymbol {\omega }})\,\,

for all

r\in \mathbb {R} _{+}.

Colors with zero number (m_i = 0) or zero weight (ω_i = 0) can be omitted from the equations.

Colors with the same weight can be joined:

{\begin{aligned}&{}\operatorname {mfnchypg} \left(\mathbf {x} ;n,\mathbf {m} ,(\omega _{1},\ldots ,\omega _{c-1},\omega _{c-1})\right)\\&{}=\operatorname {mfnchypg} \left((x_{1},\ldots ,x_{c-1}+x_{c});n,(m_{1},\ldots ,m_{c-1}+m_{c}),(\omega _{1},\ldots ,\omega _{c-1})\right)\,\cdot \\&\qquad \operatorname {hypg} (x_{c};x_{c-1}+x_{c},m_{c},m_{c-1}+m_{c})\end{aligned}}

where $\operatorname {hypg} (x;n,m,N)$ is the (univariate, central) hypergeometric distribution probability.

Share this article:

This article uses material from the Wikipedia article Fisher's_noncentral_hypergeometric_distribution, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

Fisher's_noncentral_hypergeometric_distribution

Fisher's noncentral hypergeometric distribution

Univariate distribution

Properties

Derivation

Multivariate distribution

Properties

Applications

Software available

See also

References

Share this article:

Parameters	$m_{1},m_{2}\in \mathbb {N}$ $N=m_{1}+m_{2}$ $n\in [0,N)$ $\omega \in \mathbb {R} _{+}$
Support	$x\in [x_{\min },x_{\max }]$ $x_{\min }=\max(0,n-m_{2})$ $x_{\max }=\min(n,m_{1})$
PMF	${\frac {{\binom {m_{1}}{x}}{\binom {m_{2}}{n-x}}\omega ^{x}}{P_{0}}}$ where $P_{0}=\sum _{y=x_{\min }}^{x_{\max }}{\binom {m_{1}}{y}}{\binom {m_{2}}{n-y}}\omega ^{y}$
Mean	${\frac {P_{1}}{P_{0}}}$ , where $P_{k}=\sum _{y=x_{\min }}^{x_{\max }}{\binom {m_{1}}{y}}{\binom {m_{2}}{n-y}}\omega ^{y}\,y^{k}$
Mode	$\,\,\left\lfloor {\frac {-2C}{B-{\sqrt {B^{2}-4AC}}}}\right\rfloor \,$ , where $A=\omega -1$ , $B=m_{1}+n-N-(m_{1}+n+2)\omega$ , $C=(m_{1}+1)(n+1)\omega$ .
Variance	${\frac {P_{2}}{P_{0}}}-\left({\frac {P_{1}}{P_{0}}}\right)^{2}$ , where P_k is given above.

Parameters	$c\in \mathbb {N}$ $\mathbf {m} =(m_{1},\ldots ,m_{c})\in \mathbb {N} ^{c}$ $N=\sum _{i=1}^{c}m_{i}$ $n\in [0,N)$ ${\boldsymbol {\omega }}=(\omega _{1},\ldots ,\omega _{c})\in \mathbb {R} _{+}^{c}$
Support	$\mathrm {S} =\left\{\mathbf {x} \in \mathbb {Z} _{0+}^{c}\,:\,\sum _{i=1}^{c}x_{i}=n\right\}$
PMF	${\frac {1}{P_{0}}}\prod _{i=1}^{c}{\binom {m_{i}}{x_{i}}}\omega _{i}^{x_{i}}$ where $P_{0}=\sum _{(y_{0},\ldots ,y_{c})\in \mathrm {S} }\prod _{i=1}^{c}{\binom {m_{i}}{y_{i}}}\omega _{i}^{y_{i}}$
Mean	The mean μ_i of x_i can be approximated by $\mu _{i}={\frac {m_{i}r\omega _{i}}{r\omega _{i}+1}}$ where r is the unique positive solution to $\sum _{i=1}^{c}\mu _{i}=n\,$ .

Treatment Group	responder	non-responder	Total
X	x	.	m_X
Y	y	.	m_Y
Total	n	.	N