MODELLER基础教程-单模板的同源模建

数据挖掘 发表于 2013-9-22 17:21:48

MODELLER基础教程-单模板的同源模建

MODELLER背景：我们只有1个蛋白的序列，没有其结构。寻找最好的pdb作为模板进行模建Step1根据它的序列搜索和它序列相似的有结构的蛋白。首先构建modeller的序列文件TvLDH.ali。
>P1;TvLDHsequence:TvLDH:::::::0.00:0.00MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDPKAAFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNLKPENFSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFDTFFKKIGHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPVPEGNPYGIKPGVVFSFPCNVDKEGKIHVVEGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQGG*
然后借助于脚本searchSTR.py来寻找模板
frommodeller import *log.verbose()env =environ()#--Prepare the input files#--Read in the sequence databasesdb =sequence_db(env)sdb.read(seq_database_file='pdb_95.pir',seq_database_format='PIR',       chains_list='ALL',minmax_db_seq_len=(30, 4000), clean_sequences=True)#--Write the sequence database in binary formsdb.write(seq_database_file='pdb_95.bin',seq_database_format='BINARY',       chains_list='ALL')#--Now, read in the binary databasesdb.read(seq_database_file='pdb_95.bin',seq_database_format='BINARY',       chains_list='ALL')#--Read in the target sequence/alignmentaln =alignment(env)aln.append(file='TvLDH.ali',alignment_format='PIR', align_codes='ALL')#--Convert the input sequence/alignment into# profile formatprf =aln.to_profile()#--Scan sequence database to pick up homologous sequencesprf.build(sdb,matrix_offset=-450, rr_file='${LIB}/blosum62.sim.mat',       gap_penalties_1d=(-500, -50),n_prof_iterations=1,       check_profile=False,max_aln_evalue=0.01)#--Write out the profile in text formatprf.write(file='build_profile.prf',profile_format='TEXT')#--Convert the profile back to alignment formataln =prf.to_alignment()#--Write out the alignment filealn.write(file='build_profile.ali', alignment_format='PIR')
输入文件：pdb_95.pir #这是一个结构数据库的序列文件，查询的序列文件TvLDH.ali;利用到的氨基酸相似度矩阵：blosum62.sim.mat。输出文件：build_profile.ali,build_profile.prf整个脚本的思路：读结构序列的数据库文件，然后将其转成机器码；读要查询的序列文件（ali）,将其转成Prf格式。设置查询条件（1氨基酸相似度的矩阵blosum62.sim.mat2）‘’’The profile.build()command has many options. In this example rr_file is set to use the BLOSUM62similarity matrix (file "blosum62.sim.mat" provided in the MODELLERdistribution). Accordingly, the parameters matrix_offset and gap_penalties_1d areset to the appropriate values for the BLOSUM62 matrix. For this example, wewill run only one search iteration by setting the parameter n_prof_iterationsequal to 1. Thus, there is no need for checking the profile for deviation (check_profile set toFalse). Finally, the parameter max_aln_evalueis set to 0.01, indicating that only sequences with e-values smaller than orequal to 0.01 will be included in the final profile.‘’’’注意：ali即为PIR格式，序列比对格式Prf序列搜索格式
直接用pythonsearchSTR.py就可以运行脚本，不推荐使用python2.3库的mod9.12解释器，因为这需要另外安装python2.3的库。
Step2挑选合适的模板首先分析build_profile.prf文件，第11列（倒数第二列）是序列的相似度，越大越好；最后一列是显著性，越小越好。第二列是对应的pdbcode（这时一组pdb的代表）。根据显著性来挑选4-10个pdbcode。这里我们首先挑选了显著性为0的6个pdb出来分析（显著性的指标更加好）。基于显著性进行排序。得到(1bdm:A, 5mdh:A, 1b8p:A,1civ:A, 7mdh:A, and 1smk:A).这6个结果。再从这6个结构当中选择最合适的作为模板（6个结构之间结构相似度序列相似度）。借助于脚本compare.pyfrom modellerimport *
env = environ()aln =alignment(env)for (pdb, chain)in (('1b8p', 'A'), ('1bdm', 'A'), ('1civ', 'A'),                   ('5mdh', 'A'), ('7mdh','A'), ('1smk', 'A')): m = model(env, file=pdb,model_segment=('FIRST:'+chain, 'LAST:'+chain)) aln.append_model(m, atom_files=pdb,align_codes=pdb+chain)aln.malign()aln.malign3d()aln.compare_structures()aln.id_table(matrix_file='family.mat')env.dendrogram(matrix_file='family.mat',cluster_cut=-1.0)

在运行这个脚本之前，需要把相关的蛋白下载到你的工作目录中，我利用pymol中的fetch进行下载。这个脚本不能自动产生log文件，所以需要你主动保存为log文件 pythoncompare.py >compare.log
找相近的，分组找代表的。?? 不知道这里的（compare.log）39 50.5 55.37是什么意思？？不清楚，知道的人请和我站内联系，感觉好像是越小越好。一共聚了3类，第一类有3个第二类有2个，第3类有1；所以第一类是优先的，然后在从分辨率上考虑的化选1bdm。这里只挑选了1个模板，所以是基于单模板的同源模建。
step3 比对在进行基于模板的同源模建之前需要进行序列比对。把模板的序列和要模建的序列进行比对。利用脚本 align2d.py
from modellerimport *
env = environ()aln =alignment(env)mdl = model(env,file='1bdm', model_segment=('FIRST:A','LAST:A'))aln.append_model(mdl,align_codes='1bdmA', atom_files='1bdm.pdb')aln.append(file='TvLDH.ali',align_codes='TvLDH')aln.align2d()aln.write(file='TvLDH-1bdmA.ali',alignment_format='PIR')aln.write(file='TvLDH-1bdmA.pap',alignment_format='PAP')

Step4 同源模建，根据得到的比对，进行模建，产生不同的模型；Generatemodel.py
from modellerimport *frommodeller.automodel import *
env = environ()a = automodel(env,alnfile='TvLDH-1bdmA.ali',          knowns='1bdmA', sequence='TvLDH',          assess_methods=(assess.DOPE,assess.GA341))a.starting_model =1a.ending_model = 5a.make()

这里是产生5个模建的结果。注意这里也不会自动产生log文件，需要你主动生成python generatemodel.py >genemodes.log
Step5 对产生的模型进行评价，挑选最好的模型方法1：根据打分进行挑选DOPE打分越低越好，负的越大越好GA341打分越高越好GA341 is not as good asDOPE at distinguishing 'good' models from 'bad' modelsmolPDF是默认打分函数，越小越好
Yun He wrote:Currently, I amworking on a model based on a very low sequence-identity template (about 13%).I use an alignment to produce 30 models, and then I sort these models by theirmolpdf or GA341 or DOPE. However, the sorts turn out very different results,some models with higher GA341 scores takes higher DOPE scores. I am confused,which one should I choose? Nomodel assessment score is perfect, so you should not expect them to beperfectly correlated.
A'good' model will have a GA341 score near 1.0 (larger is better) but a smallDOPE score (DOPE scores, like molpdf, are "energies", so smaller ornegative is better).
Neither DOPE nor GA341 show a lot ofdiscrimination for poor models - e.g. you can't really say that a model with aGA341 score of 0.1 is 'worse' than one with 0.2 (the score was developed totell between good and bad models, not to rank bad models). Anything worse than0.6 or so is a bad model.

对挑选的模板再进行评价；看看那些区域是合理的，那些区域是不合理的。借助于脚本evaluate_mode.py
from modellerimport *frommodeller.scripts import complete_pdb
log.verbose() # request verbose outputenv = environ()env.libs.topology.read(file='$(LIB)/top_heav.lib')# read topologyenv.libs.parameters.read(file='$(LIB)/par.lib')# read parameters
# read model filemdl =complete_pdb(env, 'TvLDH.B99990002.pdb')
# Assess withDOPE:s =selection(mdl) # all atom selections.assess_dope(output='ENERGY_PROFILENO_REPORT', file='TvLDH.profile',          normalize_profile=True,smoothing_window=15)

这里的能量是经过平滑处理的15个氨基酸作为一个窗口。最后再画氨基酸的能量折线图。仅仅是参考。更高级的评价，后面讨论
如有问题，请与我站内联系！
参考1 http://salilab.org/modeller/tutorial/basic.html2 http://www.urlshare.cn/cgi-bin/qzshare/cgi_qzshare_urlcheck?appid=2&rappid=2&url=https%3A%2F%2Fsites.google.com%2Fsite%2Fiiserbioinformatics%2Fassignment%2Fbio455-worksheet(国外网站要用代理看)

woniuzhixing 发表于 2014-11-20 19:53:07

SEQ_DATABASE_FILE       : pdb.bin
SEQ_DATABASE_FORMAT    : BINARY
CHAINS_LIST             : ALL
CLEAN_SEQUENCES       :          T
MINMAX_DB_SEQ_LEN       :          0    999999
Number of sequences    :       19
Number of residues    :       5090
Length of longest sequence:       695

openf___224_> Open       1MUH.ali
read_al_373E> Protein specified in ALIGN_CODES(i) was not found
         in the alignment file; ALIGN_CODES(    1) =ALL

Traceback (most recent call last):
File "C:\basic-example\shyan\build_profile.py", line 23, in <module>
aln.append(file='1MUH.ali', alignment_format='PIR', align_codes='ALL')
File "C:\Modeller9.14\modlib\modeller\alignment.py", line 79, in append
allow_alternates)
ModellerError: read_al_373E> Protein specified in ALIGN_CODES(i) was not found in the alignment file; ALIGN_CODES(    1) =ALL
>>>
换了一个未知序列做，上面错在哪里？还有那个搜索的库.pir怎么生成的，从哪里获得？

puzhongji 发表于 2014-11-19 15:04:38

39 50.5 55.37，感觉这些数字像是进化树上的那些距离，就是离节点的距离。这个问题之前问过群里的药物-design，他在科学网有modeller的教程。还有就是不明白做compare.py的意义。最后选择模板的标准还是分辨率和序列一致度，那么我们为什么要做这一步呢？

MerryKing 发表于 2017-12-20 09:30:50

你好，想请教一下，关于modeller同源建模时，.ali文件中怎么加入核酸（有缺失）、配体、离子呢？相对应的.py文件该怎么设置呢？谢谢大神~

dejunchem 发表于 2013-9-23 20:33:04

哈哈，必须顶~~~~

laoman 发表于 2014-3-6 21:18:51

要开始学modeller了，这篇帖子应该很有用~

zilvbu 发表于 2014-5-9 15:59:15

写得很详细，学习了

puzhongji 发表于 2014-11-19 14:48:23

我记得step2以前能保存compare.log

puzhongji 发表于 2014-11-19 14:50:16

compare.py>compare.log生成log文件

puzhongji 发表于 2014-11-19 15:05:29

弄错了，原来你就是药物-design

火麒麟90 发表于 2015-5-7 23:42:14

谢谢楼主

页: [1] 2

生物分子模拟论坛's Archiver

MODELLER基础教程-单模板的同源模建