• Statistical flaws in Excel (Excel 97 / 2000 / XP)

    Home » Forums » AskWoody support » Productivity software by function » MS Excel and spreadsheet help » Statistical flaws in Excel (Excel 97 / 2000 / XP)

    Author
    Topic
    #389984

    Thanks very much – it is (amazingly) a topic which has been on my mind lately. Just don’t expect me to read it quickly grin

    Viewing 0 reply threads
    Author
    Replies
    • #690979

      Edited by HansV to re-attach the zipped document, it was lost in the server crash of August 2007

      Many people, including myself, use Excel’s statistical functions and the Analysis Toolpak. However, Excel has some flaws that you should know when you are using these tools. I summarized many of these flaws in one document, which I wanted to present to the forum. I would appreciate your comments and suggestions.
      It’s a zipped pdf document to reduce it to the acceptable file size for uploading.

      • #691098

        Thanks for sharing this. It’s useful to have an overview.

      • #691124

        Hi Hans,

        Excellent article!

        Would it be OK if I posted this article in the Microsoft Excel MVP newsgroup?

        A message I sometimes refer to with regards to Excel’s poor Stat qualities:

        ==========
        From: Jerry W. Lewis (JWLewis53@mediaone.net)
        Subject: Re: LINEST with r2 = -1.18 ???
        Newsgroups: microsoft.public.excel.worksheet.functions
        View: Complete Thread (35 articles) | Original Format
        Date: 2001-09-26 06:40:28 PST

        LINEST() (also SLOPE(), INTERCEPT(), VAR(), STDEV(), LOGEST(), TREND(),
        FORECAST(), etc.) uses a numerically unstable algorithm. With
        challenging data (such as yours), rounding error has accumulated to the
        point that none of its calculations (slope, intercept, etc.) can be
        believed. In your case, you were lucky enough to get an impossible R^2,
        so that it was obvious that there was a problem. There may still be a
        problem even with data that give more reasonable R^2 values. These
        problems with Excel’s algorithms have been well documented for years
        (cf. Sawitzki, 1994, “Report on the reliability of data analysis
        systems” Comput. Statist. Data Anal. 18:289-301) yet Microsoft continues
        to ignore them.

        Harlan Grove’s matrix formulation simply recreates the same problem.

        DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in
        Excel that are numerically reliable. For simple linear regression, use
        the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),
        STEYX(), etc.

        slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)
        intercept = AVERAGE(y) – slope*AVERAGE(x)
        rsq = CORREL(y,x)^2
        SSreg = rsq*DEVSQ(y)
        SSresid = (1-rsq)*DEVSQ(y)
        df = COUNT(y)-2
        F = SSreg/SSresid*df
        steyx = SQRT(SSresid/df)
        se1 = steyx/SQRT(DEVSQ(x))
        seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

        This approach has the added advantage over LINEST that it allows missing
        values in the data range. However that cuts both ways, because they
        will give a wrong answer if there are data pairs where only x or y (but
        not both) are missing.

        Similarly, for univariate statistics use the following formulas instead
        of VAR(), VARP(), STDEV(), and STDEVP()

        var = DEVSQ(x)/(COUNT(x)-1)
        varp = DEVSQ(x)/COUNT(x)
        stdev = SQRT(var)
        stdevp = SQRT(varp)

        Since Microsoft has already programmed routines that would be superior
        to their unstable routines, it is puzzling why they continue to maintin
        redundant inferior code. The unstable formulas that Excel programed are
        mathematically exact (with infinite precision), so my formulas will
        agree with the Excel functions for non-challenging data sets. When they
        disagree, the dedicated Excel functions are wrong.

        There is no DEVSQA function, there is no hel for VARA(), VARPA(),
        STDEVA(), or STDEVPA() other than doing those calculations manually.

        If you are wedded to using LINEST(), then test to see if

        STDEV(x) = SQRT(DEVSQ(x)/COUNT(x))
        STDEV(y) = SQRT(DEVSQ(y)/COUNT(y))
        PEARSON(y,x) = CORREL(y,x)

        If all three of these are approximately true (say to at least 12 figures
        each), then LINEST() can probably be believed for simple linear
        regression. Figuring out when LINEST() can be believed for more complex
        models is not so simple.

        Jerry

        Richard Nolan wrote:

        > Having used LINEST for Linear regression, I think
        > successfully a few times, I now have a data set that
        > returns an r2 value of -1.18, which is not possible. I can
        > look at the data and tell r2 must be +, not negative.
        >
        > Are there two logic problems with LINEST. (a) r2 can never
        > greater than +/- 1, and ( I can see the relationship is
        > +, not -.

        • #691131

          Thanks, Jan Karel for your additional information. Yes, you can post this article in the Microsoft Excel MVP newsgroup. No problem!

      • #1068923

        I’m wondering if Excel 2007 has addressed any of these issues.

        Comments?

      • #1101692

        Hans P.,

        I’m having trouble trying to download your attachment. Maybe it disappeared when the Lounge went down? Would you or someone else please reattach it?

        Thank you!

    Viewing 0 reply threads
    Reply To: Statistical flaws in Excel (Excel 97 / 2000 / XP)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: