剽窃检测-筛选指纹冲突

Plagiarism detection - winnowing fingerprints clash

本文关键字：指纹冲突筛选检测更新时间：2023-10-16

我在大文本文件中编写剽窃检测应用程序。在阅读了许多关于它的文章后，我决定使用Winnowing算法（带有Karp-Rabin滚动哈希函数），但我对它有一些问题

数据：

我有两个简单的文本文件——第一个是较大的，第二个只是第一个的一段。

使用的算法：

这是我用来从所有散列中选择指纹的算法。

void winnow(int w /*window size*/) {
    // circular buffer implementing window of size w
    hash_t h[w];
    for (int i=0; i<w; ++i) h[i] = INT_MAX;
    int r = 0; // window right end
    int min = 0; // index of minimum hash
    // At the end of each iteration, min holds the
    // position of the rightmost minimal hash in the
    // current window. record(x) is called only the
    // first time an instance of x is selected as the
    // rightmost minimal hash of a window.
    while (true) {
        r = (r + 1) % w; // shift the window by one
        h[r] = next_hash(); // and add one new hash, if hash = -1, then it's end of file
        if(h[r] == -1)
            break;
        if (min == r) {
            // The previous minimum is no longer in this
            // window. Scan h leftward starting from r
            // for the rightmost minimal hash. Note min
            // starts with the index of the rightmost
            // hash.
            for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
                if (h[i] < h[min]) min = i;
                    record(h[min], global_pos(min, r, w));
        } else {
            // Otherwise, the previous minimum is still in
            // this window. Compare against the new value
            // and update min if necessary.
            if (h[r] <= h[min]) { // (*)
                min = r;
                record(h[min], global_pos(min, r, w));
            }
        }
    }
}

接下来，为了检测两个文件中是否有相同的文本，我只是比较两个文本中的指纹，以检查我们是否匹配。因此，为了检测剽窃，算法必须对文本中完全相同的位置开始的哈希进行检测，例如：

文本1：跑步是我给你的支票。

文本2：我的兄弟是我的胆小鬼。

为了获得具有相同值的正确哈希（这也意味着我们有相同的文本），算法应该从我用"|"或"^"指的地方提取指纹（我假设我们需要5个字符来计算哈希，不带空格）。它不能从文本1中的"|"和文本2中的"^"中获取哈希，因为这两个哈希不同，并且不会检测到抄袭。

问题：

为了检测这段文字是否是从1号文本中复制的，我必须在两个文本中的某个地方有两个相同的指纹。问题是算法选择了指纹，它们彼此不匹配，我的意思是它们只是错过了，即使是在更大的文本中。

问题：

你有什么想法吗？我该如何改进这个算法（实际上可以归结为提取指纹的正确算法），它更有可能发现剽窃？

我的想法：

我考虑过运行winnow函数几次，针对不同的窗口大小（这将导致使用不同的哈希），但对于该程序必须处理的大型文本（比如2MB的纯文本），这将花费太多时间。

如果您正在运行计算哈希的窗口，则可以在窗口移动时实时更新哈希值。该方法被称为拉宾指纹（另请参阅）。这应该允许您在O（n）运行时间内计算所有大小为X的指纹（n是输入文档的大小）。我想你引用的论文是这种方法的一些高级扩展，如果正确实现，它也应该给你类似的运行时间。关键是更新哈希而不是重新计算它。