计算一组关系的整数映射的更有效的算法

More efficient algorithm to compute an integer mapping for a set of relations

本文关键字：整数映射有效算法关系一组计算更新时间：2023-10-16

原始问题和简单算法

给定一组关系，例如

a < c
b < c
b < d < e

找到一组以 0 开头的整数(以及尽可能多的重复整数！(与关系集匹配的最有效算法是什么，即在这种情况下

a = 0; b = 0; c = 1; d = 1; e = 2

简单的算法是重复迭代关系集并根据需要增加值，直到达到收敛，如下面的 Python 中实现的那样：

relations = [('a', 'c'), ('b', 'c'), ('b', 'd', 'e')]
print(relations)
values = dict.fromkeys(set(sum(relations, ())), 0)
print(values)
converged = False
while not converged:
    converged = True
    for relation in relations:
        for i in range(1,len(relation)):
            if values[relation[i]] <= values[relation[i-1]]:
                converged = False
                values[relation[i]] += values[relation[i-1]]-values[relation[i]]+1
    print(values)

除了 O(Relations²( 复杂度(如果我没记错的话(，如果给出无效关系(例如添加 e <d(，该算法也会进入无限循环。检测这样的故障案例对于我的用例来说并不是绝对必要的，但将是一个很好的奖励。>

基于Tim Peter评论的Python实现

relations = [('a', 'c'), ('b', 'c'), ('b', 'd'), ('b', 'e'), ('d', 'e')]
symbols = set(sum(relations, ()))
numIncoming = dict.fromkeys(symbols, 0)
values = {}
for rel in relations:
    numIncoming[rel[1]] += 1
k = 0
n = len(symbols)
c = 0
while k < n:
    curs = [sym for sym in symbols if numIncoming[sym] == 0]
    curr = [rel for rel in relations if rel[0] in curs]
    for sym in curs:
        symbols.remove(sym)
        values[sym] = c
    for rel in curr:
        relations.remove(rel)
        numIncoming[rel[1]] -= 1
    c += 1
    k += len(curs)
print(values)

目前它要求关系被"分割"(b <e(，但循环检测很容易(每当curs为空且>

最坏情况下的时间(1000 个元素，999 个关系，反向顺序(：

Version A: 0.944926519991
Version B: 0.115537379751

最佳情况时机(1000 个元素、999 个关系、前瞻顺序(：

Version A: 0.00497004507556
Version B: 0.102511841589

平均案例计时(1000 个元素、999 个关系、随机顺序(：

Version A: 0.487685376214
Version B: 0.109792166323

测试数据可以通过以下方式生成

n = 1000
relations_worst = list((a, b) for a, b in zip(range(n)[::-1][1:], range(n)[::-1]))
relations_best = list(relations_worst[::-1])
relations_avg = shuffle(relations_worst)

基于 Tim Peter 答案的C++实现(简化为符号 [0， n( (

vector<unsigned> chunked_topsort(const vector<vector<unsigned>>& relations, unsigned n)
{
    vector<unsigned> ret(n);
    vector<set<unsigned>> succs(n);
    vector<unsigned> npreds(n);
    set<unsigned> allelts;
    set<unsigned> nopreds;
    for(auto i = n; i--;)
        allelts.insert(i);
    for(const auto& r : relations)
    {
        auto u = r[0];
        if(npreds[u] == 0) nopreds.insert(u);
        for(size_t i = 1; i < r.size(); ++i)
        {
            auto v = r[i];
            if(npreds[v] == 0) nopreds.insert(v);
            if(succs[u].count(v) == 0)
            {
                succs[u].insert(v);
                npreds[v] += 1;
                nopreds.erase(v);
            }
            u = v;
        }
    }
    set<unsigned> next;
    unsigned chunk = 0;
    while(!nopreds.empty())
    {
        next.clear();
        for(const auto& u : nopreds)
        {
            ret[u] = chunk;
            allelts.erase(u);
            for(const auto& v : succs[u])
            {
                npreds[v] -= 1;
                if(npreds[v] == 0)
                    next.insert(v);
            }
        }
        swap(nopreds, next);
        ++chunk;
    }
    assert(allelts.empty());
    return ret;
}

C++实现，改进了缓存位置

vector<unsigned> chunked_topsort2(const vector<vector<unsigned>>& relations, unsigned n)
{
    vector<unsigned> ret(n);
    vector<unsigned> npreds(n);
    vector<tuple<unsigned, unsigned>> flat_relations; flat_relations.reserve(relations.size());
    vector<unsigned> relation_offsets(n+1);
    for(const auto& r : relations)
    {
        if(r.size() < 2) continue;
        for(size_t i = 0; i < r.size()-1; ++i)
        {
            assert(r[i] < n && r[i+1] < n);
            flat_relations.emplace_back(r[i], r[i+1]);
            relation_offsets[r[i]+1] += 1;
            npreds[r[i+1]] += 1;
        }
    }
    partial_sum(relation_offsets.begin(), relation_offsets.end(), relation_offsets.begin());
    sort(flat_relations.begin(), flat_relations.end());
    vector<unsigned> nopreds;
    for(unsigned i = 0; i < n; ++i)
        if(npreds[i] == 0)
            nopreds.push_back(i);
    vector<unsigned> next;
    unsigned chunk = 0;
    while(!nopreds.empty())
    {
        next.clear();
        for(const auto& u : nopreds)
        {
            ret[u] = chunk;
            for(unsigned i = relation_offsets[u]; i < relation_offsets[u+1]; ++i)
            {
                auto v = std::get<1>(flat_relations[i]);
                npreds[v] -= 1;
                if(npreds[v] == 0)
                    next.push_back(v);
            }
        }
        swap(nopreds, next);
        ++chunk;
    }
    assert(all_of(npreds.begin(), npreds.end(), [](unsigned i) { return i == 0; }));
    return ret;
}

C++时间10000 个元素，9999 个关系，平均超过 1000 次运行

"最坏情况"：

chunked_topsort: 4.21345 ms
chunked_topsort2: 1.75062 ms

"最佳情况"：

chunked_topsort: 4.27287 ms
chunked_topsort2: 0.541771 ms

"平均情况"：

chunked_topsort: 6.44712 ms
chunked_topsort2: 0.955116 ms

与Python版本不同，C++ chunked_topsort在很大程度上取决于元素的顺序。有趣的是，随机顺序/平均情况是迄今为止最慢的(使用基于集合的chunked_topsort(。

这是我之前没有时间发布的实现：

def chunked_topsort(relations):
    # `relations` is an iterable producing relations.
    # A relation is a sequence, interpreted to mean
    # relation[0] < relation[1] < relation[2] < ...
    # The result is a list such that
    # result[i] is the set of elements assigned to i.
    from collections import defaultdict
    succs = defaultdict(set)    # new empty set is default
    npreds = defaultdict(int)   # 0 is default
    allelts = set()
    nopreds = set()
    def add_elt(u):
        allelts.add(u)
        if npreds[u] == 0:
            nopreds.add(u)
    for r in relations:
        u = r[0]
        add_elt(u)
        for i in range(1, len(r)):
            v = r[i]
            add_elt(v)
            if v not in succs[u]:
                succs[u].add(v)
                npreds[v] += 1
                nopreds.discard(v)
            u = v
    result = []
    while nopreds:
        result.append(nopreds)
        allelts -= nopreds
        next_nopreds = set()
        for u in nopreds:
            for v in succs[u]:
                npreds[v] -= 1
                assert npreds[v] >= 0
                if npreds[v] == 0:
                    next_nopreds.add(v)
        nopreds = next_nopreds
    if allelts:
        raise ValueError("elements in cycles %s" % allelts)
    return result

然后，例如，

>>> print chunked_topsort(['ac', 'bc', 'bde', 'be', 'fbcg'])
[set(['a', 'f']), set(['b']), set(['c', 'd']), set(['e', 'g'])]

希望有帮助。请注意，这里没有任何类型的搜索(例如，没有条件列表推导(。这使得它在理论上;-(有效。

后期：定时

在帖子末尾生成的测试数据中，chunked_topsort()对输入的顺序几乎不敏感。这并不奇怪，因为该算法只迭代一次输入来构建其(固有的无序(字典和集合。总而言之，它比 Version B 快 15 到 20 倍。 3 次运行的典型定时输出：

worst chunked  0.007 B  0.129 B/chunked  19.79
best  chunked  0.007 B  0.110 B/chunked  16.85
avg   chunked  0.006 B  0.118 B/chunked  19.06
worst chunked  0.007 B  0.127 B/chunked  18.25
best  chunked  0.006 B  0.103 B/chunked  17.16
avg   chunked  0.006 B  0.119 B/chunked  18.86
worst chunked  0.007 B  0.132 B/chunked  20.20
best  chunked  0.007 B  0.105 B/chunked  16.04
avg   chunked  0.007 B  0.113 B/chunked  17.32

数据结构更简单

鉴于问题已经改变;-(，这里有一个重写，假设输入是range(n)中的整数，并且n也被传递。在初始传递输入关系后，没有集合、无字典和动态分配。在 Python 中，这比测试数据chunked_topsort()快约 40%。但是我太老了，不能再和C++搏斗了;-(

def ct_special(relations, n):
    # `relations` is an iterable producing relations.
    # A relation is a sequence, interpreted to mean
    # relation[0] < relation[1] < relation[2] < ...
    # All elements are in range(n).
    # The result is a vector of length n such that
    # result[i] is the ordinal assigned to i, or
    # result[i] is -1 if i didn't appear in the relations.
    succs = [[] for i in xrange(n)]
    npreds = [-1] * n
    nopreds = [-1] * n
    numnopreds = 0
    def add_elt(u):
        if not 0 <= u < n:
            raise ValueError("element %s out of range" % u)
        if npreds[u] < 0:
            npreds[u] = 0
    for r in relations:
        u = r[0]
        add_elt(u)
        for i in range(1, len(r)):
            v = r[i]
            add_elt(v)
            succs[u].append(v)
            npreds[v] += 1
            u = v
    result = [-1] * n
    for u in xrange(n):
        if npreds[u] == 0:
            nopreds[numnopreds] = u
            numnopreds += 1
    ordinal = nopreds_start = 0
    while nopreds_start < numnopreds:
        next_nopreds_start = numnopreds
        for i in xrange(nopreds_start, numnopreds):
            u = nopreds[i]
            result[u] = ordinal
            for v in succs[u]:
                npreds[v] -= 1
                assert npreds[v] >= 0
                if npreds[v] == 0:
                    nopreds[numnopreds] = v
                    numnopreds += 1
        nopreds_start = next_nopreds_start
        ordinal += 1
    if any(count > 0 for count in npreds):
        raise ValueError("elements in cycles")
    return result

这又是 - 在Python中 - 对输入排序不敏感。