User:Riskanal/sandbox

From Wikipedia, the free encyclopedia
Three-term metalog distributions
Four-term metalog distribution when

The metalog distribution is a highly flexible continuous probability distribution that has simple closed form equations, can be directly parameterized by data, and belongs to the class of quantile-parameterized distributions. Together with its transforms, the metalog family of distributions provides an alternative to the Pearson distributions for data-fitting applications. Like the Pearson distributions, the metalogs can represent a wide range of uncertainties, such as those commonly encountered in economics, engineering, business, and science. In contrast to the Pearson distributions, the metalog family is more shape and bounds flexible, has simpler closed-form equations, is easier to fit to data, and is easier to simulate. The metalog distributions were developed by Tom Keelin and first published in 2016[1]. The metalog distributions are also known as the Keelin distributions.

History[edit]

In the Age of Enlightenment, the normal distribution was first published [2] as was Bayes’ theorem[3]. The normal distribution laid the foundation for much of the development of classical statistics. In contrast, Bayes' theorem laid the foundation for state-of-information, belief-based probability representations. Because state-of-information probabilities can take on any shape and may have natural bounds, probability distributions flexible enough to accommodate both were needed. Moreover, many empirical and other data sets exhibited shapes that could not be well matched by the normal or other early continuous distributions. So began the search for shape- and bounds-flexible continuous probability distributions.

In the early 20th Century, the Pearson[4] family of distributions, which includes the normal, beta, uniform, gamma, student-t, chi-square, F, and others, emerged as a major advance in shape-flexibility. These were shortly followed by the Johnson[5][6] distributions. Both families can represent the first four moments of data (mean, variance, skewness, and kurtosis) with smooth continuous curves. But they have no ability to match fifth or higher-order moments as well. For a given skewness and kurtosis, there is no choice of bounds. Their equations include intractable integrals and complex statistical functions. Fitting to data often requires iterative methods.

In the early 21st century, decision analysts seeking continuous probability distributions that would exactly go through three points on the CDF (e.g. expert-elicited quantiles corresponding CDF probabilities and ) found the Pearson and Johnson distributions generally inadequate for this purpose. In addition, decision analysts sought probability distributions that would be easy to parameterize with data (e.g. by using linear least squares). Introduced in 2011, the class of quantile-parameterized distributions (QPDs) accomplished both. While being a significant advance for this reason, the QPD used to illustrate this class, the SimpleQNormal distribution[7], had less shape flexibility than the Pearson and Johnson families and lacked boundedness flexibility. Shortly thereafter, Keelin developed the family of metalog distributions, an instance of the QPD class, which is more shape-flexible than the Pearson and Johnson families, offers a choice of boundedness, has closed form equations that can be fit to data with linear least squares, and has closed-form quantile functions which facilitate simulation.

Definition and Quantile Function[edit]

The metalog distribution is a generalization of the logistic distribution. The term "metalog" is short for "metalogistic." Starting with the logistic quantile function, , Keelin substituted power series expansions in cumulative probability for the and the parameters, which control location and scale respectively[1].

Keelin's rationale for this substitution was fivefold[1]. First, the resulting quantile function would have significant shape flexibility depending on the coefficients . Second, it would have a simple closed form that is linear in these coefficients, implying that they could easily be determined from CDF data by linear least squares. Third, the resulting quantile function would be smooth and differentiable so that a closed form PDF would be available. Fourth, simulation would be facilitated by the resulting closed-form inverse CDF. Fifth, like a Taylor series, any number of terms could be used depending on the degree of shape flexibility desired and other application needs.

Rewriting the logistic quantile function to incorporate the above substitutions yields the metalog quantile function, where cumulative probability .

Equivalently, the metalog quantile function can be expressed in terms of basis functions: , where the metalog basis functions are and each subsequent is defined as the expression that is multiplied by above.

Note that coefficient is the median since all other terms are zero when . Special cases of the metalog quantile function are the logistic distribution () and the uniform distribution ( otherwise).

Probability Density Function[edit]

Differentiating with respect to yields . The reciprocal of this quantity, , is the probability density function (PDF),

which may be equivalently expressed in terms of basis functions as

where .

Note that this PDF is expressed as a function of cumulative probability rather than variable-of-interest . To plot it, as shown in the figures, vary parametrically. Plot on the horizontal axis and on the vertical axis.

Based on the above equations, the family of metalog distributions is comprised of unbounded, semibounded, and bounded metalogs along with their SPT special cases.

Unbounded, Semibounded, and Bounded Metalog Distributions[edit]

As defined above, the metalog distribution is unbounded, except in the unusual special case where for all terms that contain . Yet, many applications require flexible probability distributions that have a lower bound , an upper bound , or both. To meet this need, Keelin used transformations to derive semi-bounded and bounded metalog distributions[1]. Such transformations are governed by a general property of quantile functions: for any quantile function and increasing function is a quantile function[8]. For example, the quantile function of the normal distribution is . The natural logarithm, , is an increasing function, so is the quantile function of the lognormal distribution. Analogously applying this property to the metalog quantile function using the transformations below yields the semibounded and bounded members of the metalog family. By considering to be metalog-distributed, all members of the metalog family meet Keelin and Powley's[7] definition of a quantile-parameterized distribution and thus possess the properties thereof.

Note that the number of shape parameters in the metalog family increases linearly with the number of terms . So, any metalog may have any number of shape parameters. In contrast, the Pearson and Johnson families of distributions are limited to two shape parameters.

SPT Metalog Distributions[edit]

The SPT (symmetric-percentile triplet) metalog distributions are a three-term special case of the unbounded, semibounded, and bounded metalog distributions[9]. These are parameterized by three points on the CDF of the form , , and , where . SPT metalogs are useful when, for example, quantiles corresponding to CDF probabilities are assessed from an expert and used to parameterize three-term metalog distributions. As noted below, certain mathematical properties are simplified by SPT parameterization.

Properties[edit]

All members of the metalog family of distributions share the following properties.

Feasibility[edit]

A function of the form of or any of its above transforms is a feasible probability distribution if and only if its PDF is greater than zero for all [7]. This implies a feasibility constraint on the set of coefficients :

for all .

In practical applications, feasibility must generally be checked rather than assumed. For , ensures feasibility. For (including SPT metalogs), the feasibility condition is and [9]. For , a similar closed form has been derived. For , feasibility is typically checked graphically or numerically.

For a given , the unbounded metalog and its above transforms share the same set of feasible coefficients[10]. So, for a given set of coefficients, confirming that for all is sufficient regardless of the transform or number of coefficients in use.

Convexity[edit]

The set of feasible metalog coefficients for all is convex. Because convex optimization problems require convex feasible sets, this property can simplify optimization problems involving metalogs. Moreover, this property guarantees that any convex combination of the vectors of feasible metalogs is feasible, which is useful, for example, when combining the opinion of multiple experts[11] or interpolating among feasible metalogs[12].

Fitting to Data[edit]

The coefficients can be determined from data by linear least squares. Given data points that are intended to characterize a metalog CDF, and matrix whose elements consist of , then, so long as is invertible, coefficients' column vector can be determined as , where and column vector .

How metalogs converge to standard normal distribution as increases from 2 to 10
Weibull distributions (blue) closely approximated by nine-term semi-bounded metalog distributions (dashed, yellow)

If , this equation reduces to , where the resulting CDF runs through all data points exactly. For SPT metalogs, it further reduces to expressions in terms of three points directly [9].

An alternate fitting method, implemented as a linear program, determines the coefficients by minimizing the sum of absolute distances between the CDF and the data subject to feasibility constraints[13].

Shape Flexibility[edit]

Metalogs are highly shape flexible. In the original paper, Keelin showed that ten-term metalog distributions parameterized by 105 CDF points from 30 traditional source distributions (including normal, student-t, lognormal, gamma, beta, and extreme value) approximate each such source distribution within a K-S distance of 0.001 or less[14].

The animated figure on the right illustrates this for the standard normal distribution, where metalogs with various numbers of terms are parameterized by the same set of 105 points from the standard normal CDF. The metalog PDF converges to the standard normal PDF as the number of terms increases. With two terms, the metalog approximates the normal with a logistic distribution. With each increment in number of terms, the fit gets closer. With 10 terms, the metalog PDF and standard normal PDF are visually indistinguishable.

Similarly, nine-term semi-bounded metalog PDFs with are visually indistinguishable from a range of Weibull distributions. The six cases shown correspond to Weibull shape parameters (0.5, 0.8, 1.0, 1.5, 2, 4). In each case, the metalog is parameterized by the nine points from the Weibull CDF that correspond to cumulative probabilities .

Such convergence is not unique to the normal and Weibull distributions. Keelin originally showed analogous results for a wide range of distributions[14] and has provided further illustrations.

Moments[edit]

The moment of the unbounded metalog distribution, , is a special case of the more general formula for QPDs[7]. For the unbounded metalog, such integrals evaluate to closed-form moments that are order polynomials in the coefficients . The first four central moments of the four-term unbounded metalog are:

Moments for fewer terms are subsumed in these equations. For example, moments of the three-term metalog are revealed by setting to zero. Moments for more terms and higher order moments () are also available. Moments for semi-bounded and bounded metalogs are not available in closed form.

Applications[edit]

For 3,474 steelhead trout caught and released on the Babine River in British Columbia during 2010-2014, empirical weight data (histogram) and 10-term log metalog PDF (blue curve) fit to this data by least squares.
Metalog panel for steelhead weight data

Metalogs are interdisciplinary. Due to their shape and bounds flexibility, they can be used to represent empirical or other data in virtually any field of human endeavor.

  • Astronomy. Metalogs were applied to assess the risks of asteroid impact[15].
  • Cybersecurity. Metalogs were used in cyber security risk assessment[13][16].
  • Eliciting and Combining of Expert Opinion. Statistics Canada elicited expert opinion on future Canadian fertility rates from 18 experts, which included use spreadsheet-based real-time PDF feedback based on five-term metalogs. The individual-expert opinions where weighted and combined into an overall metalog-based forecast[11].
  • Empirical Data Exploration and Visualization. In fish biology, a 10-term log metalog distribution was fit to the weights of 3,474 steelhead trout caught and released on the Babine River in British Columbia during 2010-2014. The bimodality is attributed to the presence of both first-time and second-time spawners in the river, the latter of which tend to weigh more[17].
  • Hydrology. A 10-term semibounded metalog was used to model the probability distribution over annual river gauge height[18].
  • Oil Field Production. Semibounded SPT metalogs were used to analyze biases in projections of oil-field production when compared to observed production after the fact[19].
  • Simulation Input Distributions. Since quantile functions in the metalog family are expressed in closed form, they facilitate Monte Carlo simulation. Substituting in uniformly distributed random samples of produces random samples of in closed form, thereby eliminating the need to invert a CDF expressed as . This approach was used to simulate the total value of a portfolio of 259 financial assets[20].
  • Simulation Output Distributions. Metalogs have also been used to fit output data from simulations in order to represent those outputs (both CDFs and PDFs) as closed-form continuous distributions. Used in this way, they are typically more stable and smoother than histograms[20].
  • Sums of Lognormals. Metalogs enable a closed-form representation of known distributions whose CDFs have no closed-form expression. Keelin et al. (2019)[12] apply this to the sum of independent identically distributed lognormal distributions, where quantiles of the sum can be determined by a large number of simulations. Nine such quantiles are used to parameterize a semi-bounded metalog distribution that runs through each of these nine quantiles exactly. Quantile parameters are stored in a table which can be interpolated for in-between values, which are guaranteed feasible by the metalogs' convexity property.

For a given application and data set, choosing the number of metalog terms requires judgment. For expert elicitation, 3 to 5 terms is usually sufficient. For data exploration and matching other probability distributions such as the sum of lognormals, 8 to 12 terms is usually sufficient. A metalog panel, which arrays the metalog PDFs for a range of for a given data set, may aid this judgment. For example, in the steelhead weight metalog panel above ... .Other tools such as Akaike information criterion and Bayesian information criteria may also be useful.

Related distributions[edit]

The following distributions are subsumed within metalog family:

  • The logistic distribution is a special case of the unbounded metalog where for all .
  • The uniform distribution is a special case of: 1) the unbounded metalog where , , and otherwise; and 2) the bounded metalog where , , , , and otherwise.
  • The log-logistic distribution, also known as the Fisk distribution in economics, is a special case of the log metalog where , and for all .
  • The log-uniform distribution is a special case of the log metalog where , , , and otherwise.
  • The logit-logistic distribution [21] is a special case of the logit metalog where for all .

Software[edit]

Freely available

Commercially available

References[edit]

  1. ^ a b c d Keelin TW (2016). "The Metalog Distributions." Decision Analysis. 13 (4): 243-277.
  2. ^ De Moivre, A. (1756). The doctrine of chances: or, A method of calculating the probabilities of events in play (Vol. 1). Chelsea Publishing Company.
  3. ^ Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philosophical transactions of the Royal Society of London, (53), pp.370-418.
  4. ^ Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions, Vol 1, Second Edition, John Wiley & Sons, Ltd, 1994, pp. 15-25.
  5. ^ Johnson, N. L. (1949). “Systems of frequency curves generated by methods of translation.” Biometrika. 36 (1/2): 149–176. doi:10.2307/2332539.
  6. ^ Tadikamalla, P.R. and Johnson, N.L. (1982). “Systems of frequency curves generated by transformations of logistic variables.” Biometrika. 69 (2): 461–465.
  7. ^ a b c d Keelin, T.W. and Powley, B.W. (2011). “Quantile-parameterized distributions.” Decision Analysis. 8 (3): 206–219.
  8. ^ Gilchrist, W., 2000. Statistical modelling with quantile functions. CRC Press.
  9. ^ a b c Keelin, T.W. (2016), pp. 269–271.
  10. ^ Powley, B.W. (2013). “Quantile Function Methods For Decision Analysis”. Corollary 12, p 30. PhD Dissertation, Stanford University
  11. ^ a b Dion, P., Galbraith, N., Sirag, E. (2020). “Using expert elicitation to build long-term projection assumptions.” In Developments in Demographic Forecasting, Chapter 3, pp. 43–62. Springer
  12. ^ a b Keelin, T.W., Chrisman, L. and Savage, S.L. (2019). “The metalog distributions and extremely accurate sums of lognormals in closed form.” WSC '19: Proceedings of the Winter Simulation Conference. 3074–3085.
  13. ^ a b Faber, I.J. (2019). Cyber Risk Management: AI-generated Warnings of Threats (Doctoral dissertation, Stanford University).
  14. ^ a b Keelin, T.W. (2016), Table 8
  15. ^ Reinhardt, J.D., Chen, X., Liu, W., Manchev, P. and Pate-Cornell, M.E. (2016). “Asteroid risk assessment: A probabilistic approach.” Risk Analysis. 36 (2): 244–261
  16. ^ Wang, J., Neil, M. and Fenton, N. (2020). “A Bayesian network approach for cybersecurity risk assessment implementing and extending the FAIR model.” Computers & Security. 89: 101659.
  17. ^ Keelin, T.W. (2016), Section 6.1.1,pp 266-267.
  18. ^ Keelin, T.W. (2016), Section 6.1.2, pp. 267-268.
  19. ^ Bratvold, R.B., Mohus, E., Petutschnig, D. and Bickel, E. (2020). “Production forecasting: Optimistic and overconfident—Over and over again.” Society of Petroleum Engineers. doi:10.2118/195914-PA.
  20. ^ a b Keelin, T.W. (2016), Section 6.2.2, pp. 271-274.
  21. ^ Wang, M. and Rennolls, K., 2005. Tree diameter distribution modelling: introducing the logit logistic distribution. Canadian Journal of Forest Research, 35(6), pp.1305-1313.

Category:Continuous distributions Category:Systems of probability distributions

External links[edit]

Convex Hull for Feasible Coefficients of Three-Term Metalogs[edit]

Feasibility condition for metalogs with terms: is any real number, and .

Convex Hull for Feasible Coefficients of Four-Term Metalogs[edit]

Convex Hull for Feasible Coefficients of Four-Term Metalogs

Feasibility for metalogs with terms is defined as follows:

  • is any real number, and
  • , and
  • If , then and (uniform distribution exactly)
  • If , then feasibility conditions are specified numerically
    • For a given , feasibility requires that number shown.
    • For a given , feasibility requires that number shown.
    • At the top of this table, the four-term metalog is symmetric and peaked, similar to a student-t distribution with 3 degrees of freedom.
    • At the bottom of this table, the four-term metalog is a uniform distribution exactly.
    • In between, it has varying degrees of skewness depending on . Positive yields right skew. Negative yields left skew. When , the four-term metalog is symmetric.

Convex Hull Equations[edit]

The feasible area can be closely approximated by an ellipse (dashed, gray curve), defined by center and semi-axis lengths and . Supplementing this with linear interpolation outside its applicable range, feasibility, given , can be closely approximated: