Statistical Reference Datasets (StRD)

Summary:

The Statistical Reference Datasets (StRD) provides a Web-based service that provides reference datasets with certified values for a variety of statistical methods.

Description:

With the widespread use and availability of statistical software, concerns about the numerical accuracy of such software are now greater than ever. Inevitably, numerical accuracy problems can exist with some of this software despite extensive testing. Indeed, this has been a continuing cause of concern for statisticians. Many have cited the need for an easily-accessible repository of reference datasets. To date no such collection has been available. In response to concerns of both the statistical community and industrial users, the Statistical Engineering Division in collaboration with the Mathematical & Computational Sciences Division and Standard Reference Data Program have developed a Web-based service that provides reference datasets with certified values for a variety of statistical methods. This service is called Statistical Reference Datasets (StRD).

Currently 58 datasets with certified values are provided for assessing the accuracy of software for univariate statistics, analysis of variance, linear regression, and nonlinear regression. The collection includes both generated and "real-world" data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression.

Certified results for linear procedures were obtained using extended precision software to code simple algorithms for each type of computation. Carrying 500 digits through all of the computations allowed calculation of output unaffected by floating point representation errors. Certified values for nonlinear regression are the "best-available" solutions, obtained using 64-bit precision and confirmed by at least two different algorithms and software packages using analytic derivatives.

The team officially released the StRD web service in August 1997 and spent the latter part of that year publicizing the web service. A special contributed paper session was presented at the 1997 Joint Statistical Meetings in August. Talks were also given at NIST Gaithersburg and Boulder.

With the increased usage of Bayesian methods, the StRD web site was updated with six more data sets in 2003 to provide reference datasets for Markov Chain Monte Carlo. The certified values for these datasets were obtained by adopting a model for which closed form solutions are known and computing the numeric values of those theoretical solutions using extended precision software.

Major Accomplishments:

L.M. Gill, Navigating Through the Statistical Reference Dataset (StRD) Website, Shaping Statistics for Success in the 21st Century, Joint Statistical Meetings '97, Anaheim, CA, August 11, 1997.

E.S. Lagergren, Statistical Reference Datasets (StRD) - An Overview, Joint Statistical Meetings, Anaheim, CA, August 11, 1997.

Initial public release August, 1999.

W.F. Guthrie, J.E. Rogers, J.J. Filliben, L.M. Gill, E. Lagergren, M.G. Vangel, Statistical Reference Datasets (StRD) for Assessing the Numerical Accuracy of Statistical Software, 31st Symposium on the Interface: Models, Predictions, and Computing, Schaumburg, IL, June 10, 1999.

W.F. Guthrie, H.-K. Liu, D. Malec, G. Yang, MCMC in StRD, NCSLI, Salt Lake City, UT, July 2004.

H.-K. Liu, W.F. Guthrie, D. Malec, G. Yang, An Update on StRD for MCMC, Joint Statistical Meetings, Toronto, ON, Canada, August 2004.

W.F. Guthrie, H.-K. Liu, D. Malec, G. Yang, Ranking the Sources of Numerical Error in MCMC Computations, Joint Statistical Meetings, Toronto, ON, Canada, August 2004.

H.-K. Liu, W.F. Guthrie, D. Malec, G. Yang, An Update on the NIST Statistical Reference Datasets for MCMC: Ranking the Sources of Numerical Error in MCMC Computations, Washington Statistical Society, Washington, DC, June 2006.

End Date:

Product publicly released.

Lead Organizational Unit:

itl

Staff:

Carroll Croarkin, James Filliben, Lisa Gill, Will Guthrie, Eric Lagergren, Hung-kung Liu, Don Malec, Mark Vangel, Grace Yang, and Nien-Fan Zhang
Statistical Engineering Division, ITL

Janet Rogers, Bert Rust
Mathematical & Computational Sciences Division, ITL

Phoebe Fagan
Standard Reference Data Program, TS

Contact

Will Guthrie
301-975-2854
will.guthrie@nist.gov
100 Bureau Drive, M/S 8980
Gaithersburg, MD 20899-8980

Hung-kung Liu
301-975-2718
hung-kung.liu@nist.gov
100 Bureau Drive, M/S 8980
Gaithersburg, MD 20899-8980