One of the key tasks
in probabilistic analysis is the selection of the appropriate distribution.
While this is of key importance in many cases it is not something that many are
sufficiently familiar with. If you have a large data sample, it is fairly easy
to use a number of tests, such as the Chi-Square or Anderson Darling, to find
the best-fitted distribution to the data. However, there are times when you
won’t have data. Still, there are a few principles that can be used to guide
you in selecting the most appropriate distribution if you don’t have any data
to guide you.
The most significant
of these is the central limit theorem. The central limit theorem tells us that
if a number of distributions are added to each other, the resultant
distribution will be very close to a Normal distribution. In fact, if the
relationship between the inputs and the outputs is close to linear over the
region of random variability, then the output distribution will also be close
to a Normal distribution. This has lead to the Normal distribution being a
default assumption for any input distribution. James Siddall, a researcher who
put a lot of focus on the selection of input distributions, argued that because
of the frequent use of the Normal distribution many think that it is indeed normal
to use the Normal distribution. This is not to say that the Normal distribution
should not be used or that it should only be used on special occasion, but it
is probably used too often.
There is another
element of the central limit theorem that is less commonly known. If a number
of distributions are multiplied (or divided by each other) then the outcome is
close to a Lognormal distribution. The Lognormal distribution can look a lot
like a Normal distribution, but it can also be very different. Therefore, given
how common multiplicative phenomena are, there are likely to be many cases
where a Lognormal distribution is the best choice. This was most evident to me
when I met an analyst who frequently dealt with biological processes. In his
experience the Lognormal distribution was very much the safest choice.
If you can work out
what kind of a process produces the variable you are interested in, you can
potentially determine a suitable distribution for that variable. For example,
casting operations are essentially multiplicative: the dimension of a cast
feature is equal to the respective dimension of the tool, multiplied by the
thermal expansion to the tool temperature, multiplied by the thermal
contraction as the casting cools. Therefore, we would be safest to assume that
the distribution for the dimension of a cast feature is likely to be well
representative of a Lognormal distribution. Another example is composite
materials. Some composites are made by gluing sheets together in different
orientations so that the composite effectively has no grain, and the material’s
strength is more uniform. The thickness of such a material is equal to the
summation of the thickness of the material that makes it up. Therefore, the
distribution of the thickness of the composite is likely to be well represented
by a Normal distribution.
This only covers two
processes: multiplicative and additive. However, this covers many real life
processes. Therefore, when you are choosing a distribution for an input
variable try to think about what kind of an operation produces the respective
feature: is it additive or multiplicative? It is not always easy to determine
this, but multiplicative is more common. Still, the operation might be
something different or there might be more to it. Consider the length of a
hypotenuse of a 3D triangle. Three edges are added together after being
squared, this will give us a Normal distribution. However, it is then square
rooted; therefore, the final distribution will be the square root of a Normal
distribution.
Comments