matlab实现吉布斯模体取样器

收录时间:2021-09-13
资源分类:Matlab 工具:MATLAB 7.7 (R2008b)

matlab实现吉布斯模体取样器

Gibbs Motif Sampler

MOTIFINFO = GIBBSMOTIFSAMPLER(SEQARRAY,MOTIFVECTOR,ALPHABET,OPTIONS)

 Searches for the motifs in a set of sequences via Gibbs

 sampling.

 

 seqArray is a cell vector of sequence strings.

 

 motifVector is a vector of uniformly-spaced motif widths you wish to

 optimize. The program iterates over these widths, and returns the

 one that is the "best", according the technique described in Lawrence et

 al, Science, Vol. 262, No. 5131 pp. 208-214, 1993.

 

 alphabet is a string that denotes the alphabet you want to use. Use 'DNA'

 to use 'ATCG', or 'Protein' to use 'ARNDCEQGHILKMFPSTWYV'. You can enter

 one in of your choice as well.

 

 Options is the options structure, which has some nifty features. The

 fields are:

 

    nTrials: The number of trials to perform for each motif alignment. The

    more the better, but the longer it takes.

 

    pcMode: A string, either 'Fraction' or 'Addition'. If you choose

    'Fraction', a fractional amount of each residue from the background

    will be used as the pseudocount. If you choose 'Addition', a static

    amount will be used for the pseudocount.

 

    pcVector: This is a scalar or vector quantity. This designates amount

    to add to each residue for the pseudocount. If you enter a scalar, that

    amount will be used for each residue in the alphabet. If you enter a

    vector, the amounts will be dispersed according to the order the

    characters are entered in your alphabet. For example, if you want to

    use a pseudocount of 1% of the counts of all residue sequences, you

    would choose pcMode as 'Fraction', and pcVector = 0.01. If you wanted

    to add 25 to each count, you would enter pcMode as 'Addition', and

    pcVector = 25.

 

    nPhaseTrials: Increasing this quantity helps reduce the chance of being

    caught in a local maximum. The higher it is, the better the search, but

    the longer it takes.

 

    nPhaseBand: This is how wide of a search range the sampler uses when

    doing phase correction (as described in Lawrence's paper). I suggest a

    value of 1/3 to 1/2 the length of your hypothesized motif.

 

    If you omit the Options structure, the program will use these defaults:

 

      nTrials = 500;

      pcVector = 0.01;

      pcMode = 'Fraction';

      relativeTol = 0.01;

      nTrials = 500;

      nPhaseTrials = 50;

      phaseBand = 25;

 

 The program returns a structure called MotifInfo, whose fields are:

 

    MotifSites: This is the vector of motif positions.

    BestMotifWidth: The best motif width.

    MaxS: The maximum value of S; which is the information per parameter.

    SVector: A vector of S values, corresponding to each motif width.

    ExecutionTimes: Time elapsed for each tested motif width.

 

 Example:

 

  sequenceArray =

 

      'ATCTCTGAGTCGCTATATCGCTCTCG'

      'TCTCTGAGATCTCCGCGCTCTCGAGTGCGATATGC'

      'TATCGCGCTATGCGCTATTATATCTCTG'

      'CGGCGCTATCGGATATAGCGTA'

      'TCTCTCGAGATCGATCGT'

      'ATTCTCTATATATCTTATATTCTTATGAGAATCTAG'

 

  MotifInfo =

 

          MotifSites: [6x1 double]

      BestMotifWidth: 5

               Motif: {6x1 cell}

                MaxS: 1.4267

             SVector: [3x1 double]

      ExecutionTimes: [3x1 double]

 

  MotifInfo.MotifSites = [22 1 1 7 1 32]'

  MotifInfo.BestMotifWidth = 5

  MotifInfo.Motif =

 

      'TCTCG'

      'TCTCT'

      'TATCG'

      'TATCG'

      'TCTCT'

      'TCTAG'

 

 MotifInfo.MaxS = 1.4267

 MotifInfo.SVector =

 

         1.4267

         1.3362

         1.1279

 

 (The execution times only really matter for your own computer.)

 

 There is some room for improvement in this program:

 

 1. The code isn't optimized; the execution time could be improve greatly.

 

 2. It is possible to modify the algorithm to search for multiple motifs -

 I am not just sure how to go about doing that.

 

 3. It is also possible to determine whether a detected motif is

 statistically significant or not - again, I'm just not sure how to do it.

 

 (c) May 6, 2010 Bradley James Ridder, University of South Florida, for Dr.

 Xiaoning Qian's Computational Molecular Biology Course, Spring 2010 term.

文件下载列表
gibbsmotifsampler.zip (5.51KB)  
附件内容(只显示2中的1个)
gibbsmotifsampler.m  
更多

目前尚无评论

用户反馈   关于我们
Copyright (©) ZHIHUISHI.COM 2013 All Rights Reserved.
京ICP备18060134号-3