This function runs a Goodness-of-fit test for discrete multivariate data. It is tested if a given observation is likely to have occurred under the assumption of an ab-initio model. A p-value can be calculated using different distance measures between observed and expected frequencies. A Monte Carlo method is provided to make the function capable of solving high-dimensional problems.

`multinomial.test(observed, prob, useChisq = FALSE, MonteCarlo = FALSE, ntrial = 100000, atOnce = 1000000)`

observed

vector describing the observation: contains the *observed numbers* of items in each category.

prob

vector describing the model: contains the *hypothetical probabilities* corresponding to each category.

useChisq

if `TRUE`

, Pearson's chisquare is used as a distance measure between observed and expected frequencies.

MonteCarlo

if `TRUE`

, the Monte Carlo approach is used.

ntrial

number of simulated samples in the Monte Carlo approach.

atOnce

a parameter of more technical nature. Determines how much memory is used for big arrays.

textual description of the method used.

sample size *n*, equals the sum of the components of the vector `observed`

.

number of categories *k* in the experiment, equals the number of components of the vector `observed`

.

textual description of the distance measure used.

vector containing the probabilities (rel. frequencies for the Monte Carlo approach) of all possible outcomes (might be huge for big *n* and *k*).

number of trials if the Monte Carlo approach was used, `NULL`

otherwise.

the calculated p-value rounded to four significant digits.

The Exact Multinomial Test is a Goodness-of-fit test for discrete multivariate data.
It is tested if a given observation is likely to have occurred under the assumption of an ab-initio model.
In the experimental setup belonging to the test, *n* items fall into *k* categories with certain probabilities
(sample size *n* with *k* categories).
The **observation**, described by the vector `observed`

, indicates how many items have been observed in each category.
The **model**, described by the vector `prob`

, assigns to each category the hypothetical probability that an item falls into it.
Now, if the observation is unlikely to have occurred under the assumption of the model, it is advisible to
regard the model as *not* valid. The p-value estimates how likely the observation is, given the model.
In particular, low p-values suggest that the model is *not* valid.
The **default approach** used by `multinomial.test`

obtains the p-values by
calculating the exact probabilities of *all* possible outcomes given `n`

and `k`

,
using the multinomial probability distribution function `dmultinom`

provided by R.
Then, by default, the p-value is obtained by summing the probabilities of all outcomes which are less likely
than the observed outcome (or equally likely as the observed outcome), i.e. by summing all \(p(i) <= p(observed)\)
(distance measure based on probabilities).
Alternatively, the p-value can be obtained by summing the probabilities of all outcomes connected with a chisquare no smaller than
the chisquare connected with the actual observation (distance measure based on chisquare).
The latter is triggered by setting `useChisq = TRUE`

.
Having a sample of size *n* in an experiment with *k* categories, the number of distinct
possible outcomes is the binomial coefficient `choose(n+k-1,k-1)`

. This number grows rapidly with increasing parameters *n* and
*k*. If the parameters grow too big, numerical calculation might fail because of time or
memory limitations.
In this case, usage of the **Monte Carlo approach** provided by `multinomial.test`

is suggested.
The Monte Carlo approach, activated by setting `MonteCarlo = TRUE`

,
simulates withdrawal of *ntrial* samples of size *n* from the hypothetical distribution specified by the vector `prob`

.
The default value for *ntrial* is `100000`

but might be incremented for big *n* and *k*.
The advantage of the Monte Carlo approach is that memory requirements and running time are essentially determined by *ntrial*
but not by *n* or *k*.
By default, the p-value is then obtained by summing the relative frequencies of occurrence of unusual outcomes, i.e. of
outcomes occurring less frequently than the observed one (or equally frequent as the observed one).
Alternatively, as above, Pearson's chisquare can be used as a distance measure by setting `useChisq = TRUE`

.
The parameter *atOnce* is of more technical nature, with a default value of \(1000000\). This value should be decremented
for computers with low memory to avoid overflow, and can be incremented for large-CPU computers to speed up calculations.
The parameter is only effective for Monte Carlo calculations.

H. Bayo Lawal (2003)
*Categorical data analysis with SAS and SPSS applications*, Volume 1, Chapter 3
ISBN: 978-0-8058-4605-8

Read, T. R. C. and Cressie, N. A. C. (1988).
*Goodness-of-fit statistics for discrete multivariate data.*
Springer, New York.

The Multinomial Distribution: `dmultinom`

# NOT RUN { ## Load the EMT package: library(EMT) ## Input data for a three-dimensional case: observed <- c(5,2,1) # observed data: 5 items in category one, 2 items in category two, 1 item in category three prob <- c(0.25, 0.5, 0.25) # model: hypothetical probability that an item falls into category one, two, or three ## Calculate p-value using default options: out <- multinomial.test(observed, prob) # p.value = 0.0767 ## Plot the probabilities for each event: plotMultinom(out) ## Calculate p-value for the same input using Pearson's chisquare as a distance measure: out <- multinomial.test(observed, prob, useChisq = TRUE) # p.value = 0.0596 ; not the same! ## Test the hypothesis that all sides of a dice pop up with the same probability (a 6-dimesional problem): pdice = 1/6 prob <- c(pdice, pdice, pdice, pdice, pdice, pdice) # the model, determined by the hypothetical probabilities observed <- c(4, 5, 2, 7, 0, 1) # the observation consisting of 19 throws ( = sample size) out <- multinomial.test(observed, prob) # p.value = 0.0357 ; better get another dice, this one seems to be biased plotMultinom(out, showmax = 10000) # the same problem using a Monte Carlo approach: # we have about 40.000 outcomes and choose at least 400000 trials (probably to be increased): out <- multinomial.test(observed, prob, MonteCarlo = TRUE, ntrial = 400000) # p.value = 0.0343 ; takes a few minutes on a laptop with 2 GB memory, 1.5 GHz speed plotMultinom(out, showmax = 5000) # }