FFT的研究 - 为什么它不快？

study of FFT - Why it's not fast?

本文关键字：为什么 FFT 更新时间：2023-10-16

我不确定这是更多的数学问题还是更多的编程问题。如果是数学，请告诉我。

我知道有很多现成的免费FFT项目。但我试图理解FFT方法。只是为了好玩和学习它。所以我制作了两种算法 - DFT 和 FFT，以比较它们。

但是我的FFT有问题。似乎效率没有太大差异。我的FFT只比DFT快一点点(在某些情况下快两倍，但它是最大加速度(

在大多数关于FFT的文章中，都有关于位反转的内容。但我看不出使用位反转的理由。大概是这样吧。我不明白。请帮助我。我做错了什么？

这是我的代码(你可以在这里复制它，看看它是如何工作的 - 在线编译器(：

#include <complex>
#include <iostream>
#include <math.h>
#include <cmath>
#include <vector>
#include <chrono>
#include <ctime>
float _Pi = 3.14159265;
float sampleRate = 44100;
float resolution = 4;
float _SRrange = sampleRate / resolution; // I devide Sample Rate to make the loop smaller,
                                          //just to perform tests faster
float bufferSize = 512;
// Clock class is for measure time to execute whole loop:
class Clock
{
    public:
        Clock() { start = std::chrono::high_resolution_clock::now(); }
        ~Clock() {}
        float secondsElapsed()
        {
            auto stop = std::chrono::high_resolution_clock::now();
            return std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
        }
        void reset() { start = std::chrono::high_resolution_clock::now(); }
    private: 
        std::chrono::time_point<std::chrono::high_resolution_clock> start;
};

// Function to calculate magnitude of complex number:
float _mag_Hf(std::complex<float> sf);
// Function to calculate exp(-j*2*PI*n*k / sampleRate) - where "j" is imaginary number:
std::complex<float> _Wnk_Nc(float n, float k);
// Function to calculate exp(-j*2*PI*k / sampleRate):
std::complex<float> _Wk_Nc(float k);

int main() {
  float scaleFFT = 512; // devide and conquere - if it's "1" then whole algorhitm is just simply DFT
            // I wonder what is the maximum of that value. I alvays thought it should be equal to
            // buffer size (number o samples) but above some value it start to work slower then DFT
  std::vector<float> inputSignal; // array of input signal
  inputSignal.resize(bufferSize); // how many sample we will use to calculate Fourier Transform
  std::vector<std::complex<float>> _Sf; // array to store Fourier Transform value for each measured frequency bin
  _Sf.resize(scaleFFT); // resize it to size which we need.
  std::vector<std::complex<float>> _Hf_Db_vect; //array to store magnitude (in logarythmic dB scale)            
                                                //for each measured frequency bin
  _Hf_Db_vect.resize(_SRrange); //resize it to make it able to store value for each measured freq value
  std::complex<float> _Sf_I_half; // complex to calculate first half of freq range
                                  // from 1 to Nyquist  (sampleRate/2)
  std::complex<float> _Sf_II_half; // complex to calculate second half of freq range
                                   //from Nyquist to sampleRate

        for(int i=0; i<(int)_Sf.size(); i++)
            inputSignal[i]  = cosf((float)i/_Pi); // fill the input signal with some data, no matter

  Clock _time; // Start measure time
for(int freqBinK=0; freqBinK < _SRrange/2; freqBinK++) // start calculate all freq (devide by 2 for two halves)
    {
        for(int i=0; i<(int)_Sf.size(); i++) _Sf[i]  = 0.0f; // clean all values, for next loop we need all values to be zero
        for (int n=0; n<bufferSize/_Sf.size(); ++n) // Here I take all samples in buffer
        {
            std::complex<float> _W = _Wnk_Nc(_Sf.size()*(float)n, freqBinK);
            for(int i=0; i<(int)_Sf.size(); i++) // Finally here is my devide and conquer
                _Sf[i]  += inputSignal[_Sf.size()*n  +i] * _W; // And I see no reason to use any bit reversal, how it shoul be????
        }
        std::complex<float> _Wk = _Wk_Nc(freqBinK);
        _Sf_I_half = 0.0f;
        _Sf_II_half = 0.0f;
        for(int z=0; z<(int)_Sf.size()/2; z++) // here I calculate Fourier transform for each freq
        {
            _Sf_I_half += _Wk_Nc(2.0f * (float)z * freqBinK) * (_Sf[2*z]  + _Wk * _Sf[2*z+1]); // First half - to Nyquist
            _Sf_II_half += _Wk_Nc(2.0f * (float)z *freqBinK) * (_Sf[2*z]  - _Wk * _Sf[2*z+1]); // Second half - to SampleRate
            // also don't see need to use reversal bit, where it shoul be??? :)
        }
        // Calculate magnitude in dB scale
        _Hf_Db_vect[freqBinK] = _mag_Hf(_Sf_I_half); // First half
        _Hf_Db_vect[freqBinK + _SRrange/2] = _mag_Hf(_Sf_II_half); // Second half
    }
  std::cout << _time.secondsElapsed() << std::endl; // time measuer after execution of whole loop
}

float _mag_Hf(std::complex<float> sf)
{
float _Re_2;
float _Im_2;
    _Re_2 = sf.real() * sf.real();
    _Im_2 = sf.imag() * sf.imag();
    return 20*log10(pow(_Re_2 + _Im_2, 0.5f)); //transform magnitude to logarhytmic dB scale
}
std::complex<float> _Wnk_Nc(float n, float k)
{
    std::complex<float> _Wnk_Ncomp;
    _Wnk_Ncomp.real(cosf(-2.0f * _Pi * (float)n * k / sampleRate));
    _Wnk_Ncomp.imag(sinf(-2.0f * _Pi * (float)n * k / sampleRate));
    return _Wnk_Ncomp;
}
std::complex<float> _Wk_Nc(float k)
{
    std::complex<float> _Wk_Ncomp;
    _Wk_Ncomp.real(cosf(-2.0f * _Pi * k / sampleRate));
    _Wk_Ncomp.imag(sinf(-2.0f * _Pi * k / sampleRate));
    return _Wk_Ncomp;
}

你犯的一个巨大错误是计算蝴蝶重量(涉及sin和cos(在飞行中(以_Wnk_Nc()为单位(。 sin和cos通常需要 10 秒到 100 秒的时钟周期，而其他蝴蝶操作只是 mul 和 add，只需要几个周期，因此需要将这些因素分解掉。所有快速FFT实现都将其作为初始化步骤的一部分(通常称为"计划创建"或类似步骤(。例如参见FFTW和KissFFT。

除了上面提到的"预先计算蝴蝶权重"优化之外，大多数FFT实现还使用SIMD指令来矢量化代码。

也不认为需要使用反转位，它应该在哪里？

第一个蝶形循环应该是反向位索引的。这些索引通常在递归内部计算，但对于循环解决方案，计算这些索引的成本也很高，因此最好在计划中预先计算它们。

结合这些优化方法可提高约 100 倍的加速速度

大多数快速的 FFT 实现要么使用预先计算的 twiddle 因子的查找表，要么使用简单的递归来动态旋转 twiddle 因子，而不是在 FFT 内部循环中调用三角数学库函数。

对于大型 FFT，使用 trig 递归公式不太可能破坏现代处理器上的数据缓存。