由有限间隔分隔的两个字母出现的次数

number of occurrences of two letters separated by a finite gap

本文关键字：两个分隔更新时间：2023-10-16

我们有一个字符串Seq，大小为N，包含z字母{a,B,C....}。我必须执行一个函数，该函数计算字母i的出现次数，该字母与j的l字母(间隙)分开:例子:

I=A, j=T and gap=10
That have to find the number of occurrence of AT,A-T,A--T,A---T,A----T,,,A---------T(- is any alphabet)

我将此代码用于DNA序列(字母z =4)，并假设它是N(L*N)复杂性:

inline const short int V (const char x){
    switch(x){
    case 'A':case 'a':
      return 0;
      break;
    case 'C':case 'c':
      return 1;
      break;   
    case 'G':case 'g':
      return 2;
      break;
    default:
      return 3;
    break;  
    }
  }
  // [[Rcpp::export]]
  IntegerVector basem(std::string seq ,int gap ,int N){
    IntegerVector res(16);
     short int x=0;
    short int y=0;
    for(int i=0;i<N-gap;i++){
      x=V(seq[i]);
      y=V(seq[i+gap]);
      res[4*x+y]++;
    }
    return res;
  }
  // [[Rcpp::export]]
  List basegap (std::string sequence,int x){
    int n=sequence.size();
    List result ;
    for(int j=1;j<x;j++) {
      result.push_back(basem(sequence,j,n));
    }
    return result;
  }

basem:计算空格出现的次数
basegap:计算从1到X的间隔出现的次数

运行示例:

a
[1] "atgccatcactcagtaaagaagcggccctggttcatgaagcgttagttgcgcgaggactggaaacaccgctgcgcccgcccgtgcatgaaatggataacgaaacgccgaaaagccttattgctggtcatatgaccgaaatcatgcagctgctgaatctcgacctggctgatgacagtttgatggaaacgcggcatcgcatcgctaaaatgtatgtcgatgaaattttctccggtctggattacgccaatttcccgaaaatcaccctcattgaaaacaaaatgaaggtcgatgaaatggtcaccgtccgcgatatcactctgaccagcacctgtgaacaccattttgttaccatcgatggcaaagcgacggtggcctatatcccgaaagattcggtgatcggtctgtcaaaaattaaccgcattgtgcagttctttgcccagcgtccgcaggtgcaggaacgtctgacgcagcaaattcttgttgccgcgctacaaacgctgctgggcaccaataacgtggctgtctcgatcgacgcggtgcattactgcgtgaaggcgcgtggcatccgcgatgcaaccagtgccacgacaacgtcctctcttggtggattgttcaaatccagtcagaatacgcgccacgagtttctgcgcgctgtgcgtcatcacaactga"
 basem(a,5,nchar(a))
 [1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28 
x<-basegap(a,10)
[[1]]
 [1] 61 38 23 48 48 39 57 35 44 58 28 38 17 44 60 33
[[2]]
 [1] 42 47 43 38 42 49 53 35 44 44 35 44 42 39 37 36
[[3]]
 [1] 50 39 44 37 35 53 44 47 46 41 47 33 39 46 32 36
[[4]]
 [1] 42 54 36 38 54 36 52 36 43 49 41 34 31 39 38 45
[[5]]
 [1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28
[[6]]
 [1] 42 44 40 42 45 44 44 45 38 44 47 38 44 45 36 28
[[7]]
 [1] 38 44 44 42 44 49 42 42 40 44 42 41 47 40 39 27
[[8]]
 [1] 34 48 47 38 46 49 41 41 47 39 39 42 42 40 40 31
[[9]]
 [1] 43 37 45 42 40 46 56 34 45 57 33 32 40 36 33 44
m<-matrix(unlist(x),nrow=16)
m
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]   61   42   50   42   40   42   38   34   43
 [2,]   38   47   39   54   50   44   44   48   37
 [3,]   23   43   44   36   42   40   44   47   45
 [4,]   48   38   37   38   37   42   42   38   42
 [5,]   48   42   35   54   47   45   44   46   40
 [6,]   39   49   53   36   46   44   49   49   46
 [7,]   57   53   44   52   36   44   42   41   56
 [8,]   35   35   47   36   49   45   42   41   34
 [9,]   44   44   46   43   45   38   40   47   45
[10,]   58   44   41   49   39   44   44   39   57
[11,]   28   35   47   41   44   47   42   39   33
[12,]   38   44   33   34   39   38   41   42   32
[13,]   17   42   39   31   38   44   47   42   40
[14,]   44   39   46   39   42   45   40   40   36
[15,]   60   37   32   38   45   36   39   40   33
[16,]   33   36   36   45   28   28   27   31   44

我的问题是:

如何降低这种计算的复杂性?
有什么想法来改进这个代码吗?

你可以在范围<在O(N)时间内使用队列数据结构>

 Run through you string.
     When you meet A, add it's position to queue
     When you meet T, add queue size to result
     If queue head <= current_position - gap
        remove head

这可以在O(N log N).中完成，算法如下:

按排序方式存储每个字母i的索引。O(N)
对于每个包含字母j的索引，在范围内查找包含字母i的索引个数。O(N log N)

在c++中，这可以通过减去迭代器来实现。例如，在第一次迭代中:

vector<int> indices;   <- Store indices of letter i
int index;   <- Index of a letter j
vector<int>:iterator it1 = upper_bound(indices.begin(),indices.end(),index+L);  <- Position just above the last index 
vector<int>:iterator it2 = lower_bound(indices.begin(),indices.end(),index-L);  <- Position at or below the first index
int num = it1 - it2;  <- number of indices with the letter i in the range

对于其他语言，可能有其他的方法来做到这一点，但由于我不是很精通不同的编程语言，我只能给出一个c++的例子。基本上你想用大索引减去小索引。

你可能还想看看二分查找的其他应用。