Direct-coupling analysis

Direct-coupling analysis of residue coevolution captures native contacts across many protein families

BackGround

It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure.
Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA).
Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information.

Here we will introduce mfDCA, an algorithm based on the mean-field approximation of DCA.

Methods

Data Extraction

For each family, the protein sequences are collected in one MSA denoted by ${(A_{1}^{a}, A_{2}^{a}, . . ., A_{L}^{a}) | a = 1, 2, . ., M}$ , where L denotes the number of MSA columns (i.e., the length of the protein domains). Alignments are local alignments to the Pfam HMM; because of the large number of proteins in each MSA, we refrained from refinements using global alignment techniques.

Sequence Statistics and Reweighting

The main inputs of DCA are reweighted frequency counts for single MSA columns and column pairs

重加权目的是为了纠正采样偏差，定义加权因子为 $\frac{1}{m^{a}}$ ，其中 $m^{a}$ 为与序列 $a$ 相同或序列同一性超过80%的序列数

定义有效序列数 $M_{eff}$ :

M_{eff} = \sum_{a = 1}^{M} \frac{1}{m^{a}}

定义频率公式:

表示 $α$ 在所有MSA的i列出现的频率 $f_{i} (α) = \frac{1}{M_{eff} + λ q^{2}} (λ + \sum_{a = 1}^{M} \frac{1}{m^{a}} δ_{α, A_{i}^{a}})$
表示 $α$ 在所有MSA的i列且 $β$ 在j列出现的联合频率 $f_{i j} (α, β) = \frac{1}{M_{eff} + λ q^{2}} (λ + \sum_{a = 1}^{M} \frac{1}{m^{a}} δ_{α, A_{i}^{a}} δ_{β, A_{j}^{a}})$

其中 $δ_{A, B}$ 表示如果 $A = B$ 则为1，否则为0； $q = 21$ 表示包括空位在内的不同氨基酸数； $λ$ 是一个伪计数，用于统计稳定性。

此时我们可以直接计算出互信息MI(根据信息熵公式):

M I_{i j} = \sum_{A, B} f_{i j} (A, B) \ln \frac{f_{i j} (A, B)}{f_{i} (A) f_{j} (B)}

但是文中认为MI表示直接与间接相关性效果不太好,我们可以通过后续来推出更好的指标DI

待更新

BackGround ​

Methods ​

Data Extraction ​

Sequence Statistics and Reweighting ​

BackGround

Methods

Data Extraction

Sequence Statistics and Reweighting