处理在减去两个彼此接近的双精度时的精度损失

Handling loss of precision in subtracting two doubles that are close to each other

本文关键字：接近双精度损失精度两个处理更新时间：2023-10-16

我有一个项目要做，我们要求解x的矩阵方程AX=B，假设a是三对角矩阵。我在C++中做了这个项目，让程序产生了正确的Matrix X，但当我试图向用户A*X-B报告错误时，我得到了一个错误！！这是因为我正在减去A*X和B，它们的条目任意地彼此接近。关于如何处理这个问题，我有两个想法，一个元素接一个元素：

根据本文，http://en.wikipedia.org/wiki/Loss_of_significance，在直接减法x-y中可能丢失多达-log2(1-y/x)个比特。让我们用pow(2,bitsLost)缩放x和y，减去两者，然后用pow(2,bitsLost)将其缩小
在数值方法课程中，这是为了：取算术共轭！使用double difference = (x*x-y*y)/(x+y);代替double difference = x-y;

好吧，那你为什么不选择一种方法继续前进呢

我在这里尝试了所有三种方法（包括直接减法）：http://ideone.com/wfkEUp。我想知道两件事：

在"缩放和除垢"方法（我有意选择二次方）和算术共轭方法之间，哪一种方法产生的误差较小（就减去大数字而言）
哪种方法在计算上更有效？/*For this, I was going to say the scaling method was going to be more efficient with a linear complexity versus the seemed quadratic complexity of the conjugate method, but I don't know the complexity of log2()*/

欢迎任何帮助！！

附言：在示例代码中，这三种方法似乎都返回了相同的double。。。

让我们看看您的一些代码没问题；这是我的Matrix.cpp代码

#include "ExceptionType.h"
#include "Matrix.h"
#include "MatrixArithmeticException.h"
#include <iomanip>
#include <iostream>
#include <vector>
Matrix::Matrix()
{
    //default size for Matrix is 1 row and 1 column, whose entry is 0
    std::vector<long double> rowVector(1,0);
    this->matrixData.assign(1, rowVector);
}
Matrix::Matrix(const std::vector<std::vector<long double> >& data)
{
    this->matrixData = data;
    //validate matrixData
    validateData();
}
//getter functions
//Recall that matrixData is a vector of a vector, whose elements should be accessed like matrixData[row][column].
//Each rowVector should have the same size.
unsigned Matrix::getRowCount() const { return matrixData.size(); }
unsigned Matrix::getColumnCount() const { return matrixData[0].size(); }
//matrix validator should just append zeroes into row vectors that are of smaller dimension than they should be...
void Matrix::validateData()
{
    //fetch the size of the largest-dimension rowVector
    unsigned largestSize = 0;
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        if (largestSize < matrixData[i].size())
            largestSize = matrixData[i].size();
    }
    //make sure that all rowVectors are of that dimension
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        //if we find a rowVector where this isn't the case
        if (matrixData[i].size() < largestSize)
        {
            //add zeroes to it so that it becomes the case
            matrixData[i].insert(matrixData[i].end(), largestSize-matrixData[i].size(), 0);
        }
    }
}
//operators
//+ and - operators should check to see if the size of the first matrix is exactly the same size as that of the second matrix
Matrix Matrix::operator+(const Matrix& B)
{
    //if the sizes coincide
    if ((getRowCount() == B.getRowCount()) && (getColumnCount() == B.getColumnCount()))
    {
        //declare the matrixData
        std::vector<std::vector<long double> > summedData = B.matrixData;    //since we are in the scope of the Matrix, we can access private data members
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < getColumnCount(); j++)
            {
                summedData[i][j] += matrixData[i][j];   //add the elements together
            }
        }
        //return result Matrix
        return Matrix(summedData);
    }
    else
        throw MatrixArithmeticException(DIFFERENT_DIMENSIONS);
}
Matrix Matrix::operator-(const Matrix& B)
{
    //declare negativeB
    Matrix negativeB = B;
    //negate all entries
    for (unsigned i = 0; i < negativeB.getRowCount(); i++)
    {
        for (unsigned j = 0; j < negativeB.getColumnCount(); j++)
        {
            negativeB.matrixData[i][j] = 0-negativeB.matrixData[i][j];
        }
    }
    //simply add the negativeB
    try
    {
        return ((*this)+negativeB);
    }
    catch (MatrixArithmeticException& mistake)
    {
        //should exit or do something similar
        std::cout << mistake.what() << std::endl;
    }
}
Matrix Matrix::operator*(const Matrix& B)
{
    //the columnCount of the left operand must be equal to the rowCount of the right operand
    if (getColumnCount() == B.getRowCount())
    {
        //if it is, declare data with getRowCount() rows and B.getColumnCount() columns
        std::vector<long double> zeroVector(B.getColumnCount(), 0);
        std::vector<std::vector<long double> > data(getRowCount(), zeroVector);
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < B.getColumnCount(); j++)
            {
                long double sum = 0; //set sum to zero
                for (unsigned k = 0; k < getColumnCount(); k++)
                {
                    //add the product of matrixData[i][k] and B.matrixData[k][j] to sum
                    sum += (matrixData[i][k]*B.matrixData[k][j]);
                }
                data[i][j] = sum;   //assign the sum to data
            }
        }
        return Matrix(data);
    }
    else
    {
        throw MatrixArithmeticException(ROW_COLUMN_MISMATCH); //dimension mismatch
    }
}
std::ostream& operator<<(std::ostream& outputStream, const Matrix& theMatrix)
{
    //Here, you should use the << again, just like you would for ANYTHING ELSE.
    //first, print a newline
    outputStream << "n";
    //setting precision (optional)
    outputStream.precision(11);
    for (unsigned i = 0; i < theMatrix.getRowCount(); i++)
    {
        //print '['
        outputStream << "[";
        //format stream(optional)
        for (unsigned j = 0; j < theMatrix.getColumnCount(); j++)
        {
            //print numbers
            outputStream << std::setw(17) << theMatrix.matrixData[i][j];
            //print ", "
            if (j < theMatrix.getColumnCount() - 1)
                outputStream << ", ";
        }
        //print ']'
        outputStream << "]n";
    }
    return outputStream;
}

您计算了两个数字x和y，这两个数字属于有限精度浮点类型。这意味着它们已经以某种方式四舍五入，这意味着在计算结果时精度损失。如果之后减去这些数字，则计算这两个已四舍五入的数字之间的差。

您编写的公式为计算差值提供了最大误差，但该误差与存储的中间结果x和y有关（再次：四舍五入）。除了x-y之外，没有其他方法会给您带来"更好"的结果（就完整计算而言，而不仅仅是差异）。简而言之：使用除x-y以外的任何公式，差异都不可能更准确。

我建议查看任意精度算术数学库，如GMP或Eigen。使用这样的库来计算方程系统不要将double用于矩阵计算。通过这种方式，您可以确保中间结果x和y（或矩阵Ax和B）是，正如您希望它们是一样精确，例如512位，这在大多数情况下应该足够了。

有限精度浮点数据类型不能表示所有可能的实数。有无限多个不同的值，因此很容易看出，并不是所有的值都可以用有限大小的类型表示。

因此，你的真正解决方案将是一个不可代表的值，这是完全合理的。在有限的数据类型中，再多的欺骗也无法得到精确的解决方案。

您需要重新校准您的期望值，以匹配有限精度浮点数据类型的实际情况。起点是每个计算机科学家应该知道的浮点运算。

对于所有回答这个问题的人：我知道，并且意外地发现，所有可能的doubles的集合的基数是有限的。我想我别无选择，只能尝试一个更高精度的数字，或者创建我自己的代表HugeDecimal的类。

您不能期望使用浮点数字获得无限精度。你应该考虑需要什么样的精度，然后选择最简单的方法来满足你的需求。因此，如果你得到了相同的结果，那么就坚持普通减法，并使用V-X答案中建议的ε。

共轭方法的复杂度是O（n^2）？你有一组固定的运算，两个加法，一个减法和一个除法。假设所有三个运算都是O（1），那么将其应用于n个数就得到了O（n）。

虽然这可能无法帮助您选择方法，但不久前我写了一个工具，可以帮助您根据期望的值类型选择精度：

http://riot.so/floatprecision.html

正如其他答案所说，你不能指望用浮点获得无限的精度，但你可以使用这样的工具来获得给定数字的最小增量和减量，并计算出最佳精度，以获得所需的精度。

通过检查大于某个给定ε（一个意义为最小可区分差异的常数）的差异来代替等式。