Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAC realignment algorithm damages hits #153

Open
dvs opened this issue May 31, 2019 · 3 comments
Open

MAC realignment algorithm damages hits #153

dvs opened this issue May 31, 2019 · 3 comments

Comments

@dvs
Copy link
Contributor

dvs commented May 31, 2019

Expected Behavior

MAC realignment algorithm is enabled by default in hhsearch/hhblits/hhalign and
it's expected to improve the quality of hits' alignments.

Current Behavior

In some cases due to MAC realignment step a significant part of the local hit is lost while
the probability/p-value/e-value/scores are reported as if the whole hit were preserved.
Moreover, the lost part of the hit might be then wrongly reported in some alternative
hit slot with unrelated probability/p-value/e-value/scores.

Steps to Reproduce

See the test example in the attached ZIP archive where YP_005352862 protein query
is searched in PF09848.8 template with request for no more than five alternative
hits ("-alt 5" option). There are two HHalign output files for default mode
(MAC is used, *.hhr file) and "-norealign" mode (w/o MAC, *-nr.hhr file).

MAC-report.zip

When MAC algorithm is disabled, i.e. "-norealign" option is used, the strongest hit covers
all the template (1-359) with 259 matching columns and significant probability/score (99.1% and 152.1, respectively).

In the default mode, when MAC is enabled, the strongest hit is attributed with the same
probability/score but it covers only part of the template: (2-163), 148 matching columns.
Some fragment of the lost part of the hit is listed below as an alternative hit, see the hit #4: (261-357), 50 matching columns. However, the probability/score of the hit #4 are wrong, they correspond to the alternative hit #5 of the report where MAC is not used. It appears to be some kind of mix-up, see the discussion below.

Discussion

When MAC algorithm is enabled PosteriorDecoder::realign() method is called for each Viterbi local hit. This procedure performs the following steps for improving the hit's alignment:

. . .
forwardAlgorithm(curr_q_hmm, curr_t_hmm, hit, p_mm, viterbi_matrix, shift, 0);
backwardAlgorithm(curr_q_hmm, curr_t_hmm, hit, p_mm, viterbi_matrix, shift, 0);
macAlgorithm(curr_q_hmm, curr_t_hmm, hit, p_mm, viterbi_matrix, mact, 0);
backtraceMAC(curr_q_hmm, curr_t_hmm, p_mm, viterbi_matrix, 0, hit, corr);
. . .

It's implicitly assumed that the MAC local hit produced by macAlgorithm() and backtraceMAC()
corresponds to the initial Viterbi hit and covers the same query/template region.
However it's not necessarily true, as it follows from the test example, one Viterbi hit could be split into several local MAC hits. If this happens only the coordinates of the strongest MAC hit are reported by macAlgorithm().

Also during the work of MAC algorithm viterbi_matrix is masked in a special way
which disables search in the regions that are in conflict with the initial Viterbi hit (and previously generated MAC hits) or diverge too far from the Viterbi hit path. But the mask allows the extension of the Viterbi hit: see PosteriorDecoder::maskViterbiAlignment() method where upper-left and down-right corners are not masked. This feature allows the algorithm to produce MAC hit which is completely outside of the Viterbi hit (i.e. lying both upstream or downstream in query/template space).
It happened in the test example where for the hit #4 the second part of the strongest
hit (261-357) was reported instead of the weak hit (3-47) that should have been reported.
Note that even though this part was eventually reported it happened accidentally, its probability/scores are wrong, and unless "-smin 0" option were used this hit wouldn't be listed.

Unfortunately, there is no apparent way to fix these issues in the code.
The second issue (MAC hit outside Viterbi hit's scope) can be easily detected in backtraceMAC() but at this point only the strongest hit is known,
because it's the only output from macAlgorithm(). Another issue, MAC hit split, would also need careful consideration of several local MAC hits that might comprise one hit (but also could be in conflict with each other).

@martin-steinegger
Copy link
Member

Thank you for reporting this detailed analysis of an issue. One possible fix would be to adjust the masking code to now allow for the corner extension and limit the band only around the viterbi alignment. Disadvantage is that an alignment can then not be global anymore. The best solution would be a banded mact that starts at some reliable point of the viterbi alignment. However we do not have the resources to develop something like this now.

@dvs
Copy link
Contributor Author

dvs commented May 31, 2019

One possible fix would be to adjust the masking code to now allow for the corner extension and limit the band only around the viterbi alignment.

You mean "to not allow for the corner extension" ? This would make new version incompatible with old results produced when extensions were allowed (could be considered as a regression).
I also thought there is already a band around Viterbi alignment (FWD_BKW_PATHWITDH=40; // cell off path width around viterbi alignment).

The best solution would be a banded mact that starts at some reliable point of the viterbi alignment. However we do not have the resources to develop something like this now.

Perhaps, it's not difficult to detect such cases and fall back to Viterbi alignment with appropriate warning message? I mean that the partial hit loss can be detected by significant decrease of number of matched columns and wrong hit location can be detected even more easily.
I think I can test this approach and make a pull request.

Also, User Guide What does the maximum accuracy alignment algorithm do? section warns about possible discrepancies between scores and alignments of MAC hits, do you agree that it would be useful if it warned about possible issues with the current MAC algorithm as well ?

dvs added a commit to dvs/hh-suite that referenced this issue Jul 3, 2019
This commit introduces two checks in the MAC realignment algorithm
in order to alleviate Issue soedinglab#153 ("MAC realignment algorithm damages hits",
soedinglab#153):

1. If MAC hit is significantly shorter than the Viterbi
hit in terms of number of matched columns then the original Viterbi hit
is restored and an appropriate warning is issued.
This happens when MAC algorithm generates several disjoint local
hits or when MAC hit is shorter due to weak score at the start/end
regions of the Viterbi hit.
The threshold on the MAC hit length is defined by a new
option, -mac_min_length (default=0.9, i.e. at least 90% of number of
matched columns in Viterbi hit must be in the corresponding MAC hit).

There is no obvious way to deal with hit split in the current
MAC algorithm aproach, it is expected that the algorithm should
find the same Viterbi hit albeit improved and extended at its borders.
Usually the MAC hit restores its integrity as soon as the -mact
option is decreased. For example, in the test case described in the
issue report one could find the proper MAC hit if "-mact 0.33" is used
(instead of the default "-mact 0.35" value).
Therefore the warning also suggests decreasing -mact option as a viable
solution if MAC alignment is still needed.

Note: It would be nice to have support for "-mact auto" option
that forced the MAC algorithm to decrease -mact to the max. value
that restores the hit integrity (i.e. match the original Viterbi
hit to the degree defined by -mac_min_length option).

2. MAC realignment algorithm also can find a local hit that is
completely disjoint from the original Viterbi hit.
This possibility follows from the fact that viterbi_matrix
used in MAC DP algorithm is masked (see maskViterbiAlignment())
in a way that allows extension of the hit.
This commit adds a check that the found MAC hit has intersection
with Viterbi hit in both, query and target, coordinate spaces.
If the check fails, the alternative MAC hit that satisfies this
condition is sought (in decreasing score order).
When no valid MAC hit is found, the original Viterbi hit is restored
and an appropriate warning is printed.
This error was also observed in the reported issue, see the hit soedinglab#4
(Score=11.3) that was mapped erroneously to the second part of
main hit, template columns [261; 357].

The alternative MAC hits' positions are collected in macAlgorithm()
procedure and later used in backtraceMAC() procedure.
@dvs dvs mentioned this issue Jul 3, 2019
@dvs
Copy link
Contributor Author

dvs commented Jul 3, 2019

I've submitted a pull request for the changes that address this issue.
Note also that the described MAC hit split disappears when -mact 0.33 is used (instead of the default -mact 0.35). The proposed code improvement in this case restores Viterbi hit and also issues a warning that suggests user to lower -mact option value in order to avoid MAC hit degradation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants