并行程序与线性程序相比没有速度增加

Parallel program no speed increase vs linear program

本文关键字：程序速度增加线性并行更新时间：2023-10-16

我创建了一个更复杂程序的模型程序，该程序将利用多线程和多个硬盘来提高性能。数据大小太大，无法将所有数据读入内存，因此数据将被分块读取、处理和写回。该测试程序使用流水线设计，能够在3个不同的线程上同时读取、处理和写入。因为读取和写入是到不同的硬盘驱动器，所以同时读取和写入没有问题。然而，使用多线程的程序似乎比其线性版本（也在代码中）慢2倍。我试过在运行区块后不销毁读写线程，但同步似乎比当前版本更慢了。我想知道我是不是做错了什么，或者我该如何改进。非常感谢。

使用i3-2100@3.1ghz和16GB内存进行测试。

#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192    //size of each chunk to process
#define DATASIZE 2097152  //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
    Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
    int i = left, j = right;
    int tmp;
    int pivot = arr[(left + right) / 2];
    while (i <= j) {
        while (arr[i] < pivot) i++;
        while (arr[j] > pivot) j--;
        if (i <= j) {
            tmp = arr[i];
            arr[i] = arr[j];
            arr[j] = tmp;
            i++;
            j--;
        }
    };
    if (left < j) quickSort(arr, left, j);
    if (i < right) quickSort(arr, i, right);
}
/*
    Find runtime
*/
void diffclock(){
    double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
    cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
    Read a chunk of data
*/
void readData(){
    for(int i = 0; i < CHUNKSIZE; i++){
        infile>>data[run%3][i];
    }
    finishRead = true;
}
/*
    Write a chunk of data
*/
void writeData(){
    for(int i = 0; i < CHUNKSIZE; i++){
        outfile<<data[(run-2)%3][i]<<endl;
    }
    finishWrite = true;
}
/*
    Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
    starttime = clock();
    infile.open("/home/pcg/test/iothread/source.txt");
    outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
    thread read, write;
    run = 0;
    readData();
    run = 1;
    readData();
    quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
    run = 2;
    while(run < totalRun){
        //cout<<run<<endl;
        finishRead = finishWrite = false;
        read = thread(readData);
        write = thread(writeData);
        read.detach();
        write.detach();
        quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
        while(!finishRead||!finishWrite){}  //check if next cycle is ready.
        run++;
    }

    quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
    writeData();
    run++;
    writeData();
    infile.close();
    outfile.close();
    endtime = clock();
    diffclock();
}
/*
    Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
    int totalRun = DATASIZE/CHUNKSIZE;
    int holder[CHUNKSIZE];
    starttime = clock();
    infile.open("/home/pcg/test/iothread/source.txt");
    outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
    run = 0;
    while(run < totalRun){
        for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
        quickSort(holder, 0, CHUNKSIZE - 1);
        for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
        run++;
    }
    endtime = clock();
    diffclock();
}
/*
    Create large amount of data for testing
*/
void createData(){
    outfile.open("/home/pcg/test/iothread/source.txt");
    for(int i = 0; i < DATASIZE; i++){
        outfile<<rand()<<endl;
    }
    outfile.close();
}

int main(){
    int mode=0;
    cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
    cout<<"Enter moden1.Create Datan2.thread copyn3.linear copyninput mode:";
    cin>>mode;
    if(mode == 1) createData();
    else if(mode == 2) threadtransfer();
    else if(mode == 3) lineartransfer();
    return 0;
}

不要忙着等待。这浪费了宝贵的CPU时间，很可能会减慢其余的时间（更不用说编译器可以将其优化为无限循环，因为它无法猜测这些标志是否会更改，所以它一开始甚至都不正确）。也不要detach()。用join():替换detach()和忙等待

while (run < totalRun) {
    read = thread(readData);
    write = thread(writeData);
    quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
    read.join();
    write.join();
    run++;
}

至于全局设计，忽略全局变量，如果你不希望处理（quickSort）部分超过读/写时间，我想这是可以接受的。例如，我会使用消息队列在各个线程之间传递缓冲区（如果需要，可以添加更多的处理线程，可以并行执行相同的任务，也可以按顺序执行不同的任务），但这可能是因为我习惯了这样做。

由于您在Linux机器上使用clock测量时间，我预计无论您运行一个线程还是多个线程，总CPU时间（大致）都是相同的。

也许您想改用time myprog？或者使用gettimeofday获取时间（这将给你一个以秒+纳秒为单位的时间[尽管纳秒可能不"准确"到最后一位]。

编辑：接下来，在写入文件时不要使用endl。它大大降低了速度，因为C++运行时会转到并刷新文件，这是一个操作系统调用。几乎可以肯定的是，它在某种程度上受到了保护，不受多个线程的影响，所以一次有三个线程同步地写数据，一行。最有可能花费的时间是运行单个线程的3倍。此外，不要从三个不同的线程写入同一个文件——这在某种程度上会很糟糕。

如果我错了，请纠正我，但你的线程函数似乎基本上是一个线性函数，它的功是线性函数的3倍？

在一个线程程序中，你会创建三个线程，并在每个线程上运行一次readData/quicksort函数（分配工作负载），但在你的程序中，线程模拟实际上只是读取三次、快速排序三次和写入三次，并加总完成这三次所需的时间。