Created
October 19, 2020 14:02
-
-
Save wojdyr/acd6d227db6574d69dee7c4af17ef63c to your computer and use it in GitHub Desktop.
my thoughts about diffrn-data-set-extension
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dear All, | |
I read carefully the proposal. I'm involved in handling SF-mmCIF from | |
both sides: preparing files for deposition and adding support for | |
reflection mmCIF files in programs such as Aimless. So I took time to | |
think about the proposal, to check examples, check how unmerged data | |
is handled in imgCIF, how it is currently stored in the _diffrn_refln | |
category in the 328 PDB entries that use this category, how it is | |
stored in different formats that we will need to convert between (MTZ, | |
XDS ASCII) and, to get a wider perspective, over the last months I | |
asked questions about the data deposition to various people. | |
Currently, the main blockers for depositing more of useful data is | |
(1) that the software used in OneDep supports only part of the current | |
specification (missing essential bits) and | |
(1a) it's not documented what exactly is supported, | |
(1b) it can't be easily checked by trials and errors (but I'm aware | |
that the plan is to move sf_convert to a public repository and | |
then this will be possible), | |
(2) the unmerged data description in the current spec is also missing | |
important things. | |
The proposal is a complete overhaul of the SF-mmCIF files. It improves | |
on (2), but it adds a lot of complexity that will slow down (1) and | |
also hamper using SF-mmCIF by other programs. | |
Overall, adopting the proposal would delay the deposition of (more | |
meaningful) unmerged data by months or years. | |
I appreciate writing the proposal took a great deal of effort. In | |
every such project the knowledge gained in the process of writing is | |
more important than the written text. In my opinion, to make the | |
deposition of unmerged data widespread in a reasonable time, we should | |
take the knowledge but drop the proposal. And instead, focus on the | |
gradual improvement to the current specification. | |
The best thing in the proposal is that it adds annotations on the | |
image level (currently, properties such as the wavelength or phi angle | |
are linked to individual reflections, which is not ideal). But the | |
same could be done by adding a tag such as _diffrn_refln.frame_id to | |
the current spec -- that's a tag from imgCIF. | |
From what I understand, the main intended benefit of the proposal is | |
what was called "containerization" of the data. Each block is | |
explicitly marked as | |
type_merged='true'/'false' | |
and | |
type_scaled='true'/'false' | |
and the correspondence between merged and unmerged data is recorded. | |
The distinction between merged and unmerged data is already clear | |
because different categories are used to describe both. | |
The scaled/unscaled clarification is indeed | |
missing in the current spec. Again, a simpler solution could be used: | |
document _diffrn_refln.intensity_net as scaled (which is how it is | |
used in most of the PDB entries) and, if needed, add a new tag such as | |
_diffrn_refln.unscaled_intensity. | |
The correspondence between datasets should be more explicit, | |
but this also can be done in a backward-compatible way. | |
The most important thing to ensure data consistency would be validation | |
(software again) that checks if the unmerged data corresponds to the | |
merged one. | |
Another change that the proposal introduces is making tags more | |
descriptive. Reflection tables _refln and _diffrn_refln are renamed to | |
_pdbx_diffrn_merged_refln and _pdbx_diffrn_unmerged_refln. | |
I appreciate informative names, but I don't think that their benefit | |
outweighs backward compatibility. (extra bonus from the current | |
naming: it's similar to what is used for small molecules). | |
I try to keep this email short and I focus only on the good parts of | |
the proposal. There are also questionable things. Some points have | |
been raised in the Issues section of the proposal and on every meeting | |
someone reminds that these points wait to be addressed. | |
But what I'm arguing for is changing the approach. Instead of the | |
long-discussed waterfall change, make smaller, iterative improvements | |
to the current specification and adapt the OneDep software at the | |
same time. First remove road-blockers. This way we will start getting | |
unmerged depositions quickly, developers will start to use them for | |
validation and for method development, and we will be able to make | |
better-informed decisions about next changes. | |
The first iteration could look like this: | |
0a) formally indicate which categories in the spec are for reflection | |
files - by moving them into a separate DDL file, | |
0b) make a list of categories/tags that are never used in the PDB | |
archive and are not supported by the software, ask the WG which | |
can be useful, remove the rest. | |
1) add a single new tag for "centroid of image numbers that recorded | |
the Bragg peak" (as XDS docs call it). | |
Kind regards, | |
Marcin |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment