UnTangle
Challenge data set for multi-conformer macromolecular model building
Install / Use
/learn @jmholton/UnTangleREADME
The good news is: knowledge of chemistry, combined with R factors, appears to be a powerful indicator of how near a model is to being untangled. What is really exciting is that the genuine, underlying ensemble cannot be tangled. The true ensemble defines the density; it is not being fit to it. The more untangled a model gets the closer it comes to the true ensemble, with deviations from reasonable chemistry becoming easier and easier to detect. In the end, when all alternative hypotheses have been eliminated, the model must match the truth.<p>
To demonstrate, I have created a series of examples that are progressively more difficult to solve, but the ground truth model and density is the same in all cases. Build the right model, and it will not only explain the data to within experimental error, and have the best possible validation stats, but it will reveal the true, underlying cooperative motion of the protein as well.<p>
Unless, of course, you can prove me wrong?
<p> <H2>The challenge:</h2> The <a href=refme.mtz>density data</a> for this challenge is simulated, but the problem to be solved is very real. Without a well-known <a href=ground_truth.pdb>ground truth</a> model there is no hope of solving it, so we begin with the simplest, easiest possible situation: 1.0 A data, perfect phases, excellent geometry, and the simpest possible ensemble: an ensemble of two. <p> All relevant files are available below, as a <a href=untangle.tar.gz>tarball</a>, and on <a href=https://github.com/jmholton/UnTangle>github</a>. <p> <H3>Level 0: Cheating</h3> Refine the model in <tt><a href=best.pdb>best.pdb</a></tt> against the data in <tt><a href=refme.mtz>refme.mtz</a></tt>. You will find they agree fantastically. Not just R<sub>free</sub>=3.1% but all validation metrics are excellent. There are three rare rotamers, but all three are clearly supported by the density. I have combined many geometry validation scores into a unified, weighted energy (wE, <a href=#score>described below</a>), which for this model is wE=18.27. There is nothing wrong with this structure. It matches the underlying ground-truth, and it also conveys a cooperative whole-molecule transition between the two underlying states.<p>Why can't we do this with real data? Because our models need to be untangled.<p>
<H3>Level 1: One thing wrong: Ala39 - bond stretch</h3> Now refine <tt><a href=otw39.pdb>otw39.pdb</a></tt> against the same data as before: <tt><a href=refme.mtz>refme.mtz</a></tt>. This model has exactly the same atoms as <tt><a href=best.pdb>best.pdb</a></tt> with exactly the same connectivity, but refined from a different starting position. The stats are notably worse: R<sub>free</sub>=3.4%, wE=63.5. Several geometry outliers are striking, such as the N-Cα bond of Ala39, which is 11 sigmas too long. It may surprise many that refinement does not fix this: not refmac5, not coot, not shelxl, not phenix.refine, simulated annealing or any standard protocols therein can turn <tt><a href=otw39.pdb>otw39.pdb</a></tt> into <tt><a href=best.pdb>best.pdb</a></tt>, despite them being rms 0.02 A different. <tt><a href=otw39.pdb>otw39.pdb</a></tt> is trapped in a local minimum. That is the animated picture at the top of this page. Good news is: both the difference map and the geometry outliers are screaming about where the problem is, and you only need to do one thing to fix it: the <a href=#weightsnap>weight snap trick</a>. After that, the model refines to a structure essentially identical to <tt><a href=best.pdb>best.pdb</a></tt>, and has wE=18.2 again. <p> <ul><li>This level of the UNTANGLE challenge is now <b>solved</b>, because the <a href=#weightsnap>weight snap trick</a> is an efficient, automated solution.</li></ul> <p> <H3>Level 2: One thing wrong: Val1 - clash</h3> The model <tt><a href=otw1.pdb>otw1.pdb</a></tt> is also trapped in a local minimum. Again, it has the same atoms as <tt><a href=best.pdb>best.pdb</a></tt>, and, again, no refinement program can escape the trap. The higher stats: R<sub>free</sub>=3.4%, wE=24.8 arise mainly from a Molprobity clash between Val1 and a nearby water (S56). The ground truth has no clashes. Unlike <tt><a href=otw39.pdb>otw39.pdb</a></tt>, however, the <a href=#weightsnap>weight snap trick</a> does not work. After a weight snap, this model ends up back to where it started. What is needed here is a <a href=#confswap>conformer swap trick</a>. It is tempting to swap water S56, but that leads to two clashes, and the score rises to wE=27. The thing to do here is swap the conformers of the entire Val1 side chain and re-refine. Then there are no clashes, and the angle and torsion stats within Val1, although not bad, get markedly better after the swap. The result, as above, is identical to <tt><a href=best.pdb>best.pdb</a></tt>. <p> <ul><li>This level of the UNTANGLE challenge is <b>semi-solved</b>, because applying the <a href=#confswap>conformer swap trick</a> to each side chain, one at a time, is an expensive, yet tractable automated solution.</li></ul> <H3>Level 3: Lots of things wrong</h3> The model <tt><a href=lotswrong.pdb>lotswrong.pdb</a></tt> is trapped in a local minimum similar to the one-thing-wrong levels. Again, it has the same atoms as <tt><a href=best.pdb>best.pdb</a></tt>, and, again, no refinement program can escape the trap. The stats: R<sub>free</sub>=4.4% and wE=104 are much higher because instead of just one group of atoms to swap, there are 129 atoms in 75 residues (43 protein, 32 water) that need swapping. In its current state, this model has six clashes, 6-sigma bond and angle deviates, bad CB deviations, and many other problems. But, simply swap the right conformer letters and re-refine and you get <tt><a href=best.pdb>best.pdb</a></tt>. The Challenge is to figure out which ones to swap. I'm not going to tell you. <p> <ul><li>This level of the UNTANGLE challenge is <b>not solved</b>, until someone other than James Holton solves it.</li></ul> <H3>Level 4: Anisotropic B factors</h3> The ground truth of this challenge has all isotropic B factors, but two conformers of every atom. In many cases these alternates are close together and the traditional way of modelling that situation is an anisotropic B factor. This also has the advantage of reducing the number of bonds, angles and other entities that can go wrong. And, perhaps more, the fused alt conformers are intrinsically untangled. By starting with a fully 2-strand model and fusing nearby atoms, I managed to arrive at <tt><a href=fused.pdb>fused.pdb</a></tt>, with R<sub>free</sub>=6.6%, and wE=59.8. Winning the challenge from here means splitting all the 1-conformer atoms and guessing right at their A-B assignments. <p> <ul><li>This level of the UNTANGLE challenge is <b>not solved</b>.</li></ul> <H3>Level 5: Start with a manually-built model</h3> Starting with the qFit model below, Tom Peat and I did some back-and-forth manual model building and produced <tt><a href=manual_built.pdb>manual_built.pdb</a></tt>. This model is in a state where I think most people would call it "done", with R<sub>free</sub>=4.56%, and wE=93.4. As usual, most of the problems are clashes and non-bond interactions, but there are clear bond, angle and torsion deviations with a twisted peptide and some bad CB deviations. This situation is much more reflective of real-world models than the above Levels. Some might suspect this is not realistic enough because R<sub>free</sub> is so low. The main reason for this is because ground-truth bulk solvent is flat, at just two conformers the model is fairly simple, and because the data go to 1.0 A resolution. I could have made the ground-truth more realistic by <a href=#worsedata>messing up the data</a> in various ways, but the traps are still there. I think this is already hard enough, so I made the maps pretty. <ul><li>This level of the UNTANGLE challenge is <b>not solved</b>.</li></ul> <p> <H3>Level 6: Start with <a href=https://pubmed.ncbi.nlm.nih.gov/33210433/>qFit</a></h3> One of the few programs written especially for multi-conformer model building is qFit. The best model I have obtained with it is here: <tt><a href=qfit_best.pdb>qfit_best.pdb</a></tt>, with R<sub>free</sub>=9.3%, and wE=97.3. Clearly some work remains to be done, and qFit is only intended to be a starting point. So, you may want to start here. <ul><li>This level of the UNTANGLE challenge is <b>not solved</b>.</li></ul> <p> <H3>Level 7: Start with phenix.autobuild model</h3> The output of phenix.autobuild <tt><a href=phenix_aut