Skip to content
Direct-coupling analysis

Direct-coupling analysis of residue coevolution captures native contacts across many protein families


BackGround

It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure.

Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA).

Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information.

Here we will introduce mfDCA, an algorithm based on the mean-field approximation of DCA.

Methods

Data Extraction

For each family, the protein sequences are collected in one MSA denoted by {(A1a,A2a,...,ALa)|a=1,2,..,M}, where L denotes the number of MSA columns (i.e., the length of the protein domains). Alignments are local alignments to the Pfam HMM; because of the large number of proteins in each MSA, we refrained from refinements using global alignment techniques.

Sequence Statistics and Reweighting

The main inputs of DCA are reweighted frequency counts for single MSA columns and column pairs

重加权目的是为了纠正采样偏差,定义加权因子为 1ma ,其中 ma 为与序列 a 相同或序列同一性超过80%的序列数

定义有效序列数Meff:

Meff=a=1M1ma

定义频率公式:

  • 表示α在所有MSA的i列出现的频率fi(α)=1Meff+λq2(λ+a=1M1maδα,Aia)
  • 表示α在所有MSA的i列且β在j列出现的联合频率fij(α,β)=1Meff+λq2(λ+a=1M1maδα,Aiaδβ,Aja)

其中 δA,B 表示如果 A=B 则为1,否则为0;q=21 表示包括空位在内的不同氨基酸数;λ 是一个伪计数,用于统计稳定性。

此时我们可以直接计算出互信息MI(根据信息熵公式):

MIij=A,Bfij(A,B)lnfij(A,B)fi(A)fj(B)

但是文中认为MI表示直接与间接相关性效果不太好,我们可以通过后续来推出更好的指标DI

DI与MI以及贝叶斯效果对比
DI与MI以及贝叶斯效果对比

待更新