-
Notifications
You must be signed in to change notification settings - Fork 9
Refactor MzMLWriter: Simplify implementation and improve CV compliance #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include: - Support for both regular and indexed mzML output - Parallel processing of spectrum data for improved performance - Correct implementation of controlled vocabulary parameters with required value attributes - Binary encoding of m/z and intensity arrays with optional zlib compression - Support for centroid and profile spectra - Complete metadata representation including instrument configuration, file description, and processing information - Proper calculation of byte offsets for efficient random access in indexed files - Automatic generation of SHA-1 file checksums - XML formatting according to the PSI standard specification This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format. Added unit tests to validate the writer's functionality and output file structure compliance.
Major refactoring of the MzMLWriter class to create a cleaner, more maintainable implementation: **Core Changes:** - Remove indexing functionality and multiprocessing for simplified base implementation - Add comprehensive type annotations and detailed docstrings throughout - Create cv_constants.py module with all PSI-MS controlled vocabulary terms - Fix CV term mappings to comply with official PSI-MS ontology **CV Compliance Fixes:** - Correct activation method identifiers (ETD: MS:1000598, ECD: MS:1000250, etc.) - Update software identification to MS:1001456 "analysis software" for public alpharaw - Fix integer formatting for MS level and charge state parameters - Add proper analyzer type mappings and scan list CV terms - Ensure all CV terms reference official PSI-MS controlled vocabulary **Code Quality Improvements:** - Extract all CV constants to dedicated module for maintainability - Add comprehensive type hints for better IDE support and static analysis - Improve error handling and edge case management - Consolidate binary array creation logic - Add detailed parameter documentation **Testing:** - Update comprehensive test suite with CV compliance validation - Add tests for binary precision, compression, and integer formatting - Test edge cases including empty spectra handling - Validate XML structure and required mzML elements **Breaking Changes:** - Remove `indexed` parameter from save_mzml method - Remove multiprocessing options for cleaner interface **Compatibility:** - mzML files work when importing into Skyline and searching with MSFragger - these mzML files can be searched with DIANN after conversion to indexed mzML with MSConvert
…-writter-simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the MzMLWriter implementation to improve maintainability and PSI‑MS controlled vocabulary compliance, while also updating tests and consolidating CV constants.
- Core improvements in the MzMLWriter with enhanced type annotations and clearer structure
- CV compliance fixes and extraction of CV constants to a dedicated module
- Updates to the test suite and integration via the MSData_Base class
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/unit/test_mzml_writter.py | New unit tests for verifying mzML output and CV compliance |
alpharaw/mzml_io/mzml_writer.py | Refactored mzMLWriter implementation with improved structure |
alpharaw/mzml_io/cv_constants.py | Added dedicated module for PSI‑MS controlled vocabulary |
alpharaw/ms_data_base.py | Integrated save_mzml method using the new MzMLWriter |
Comments suppressed due to low confidence (1)
tests/unit/test_mzml_writter.py:1
- The filename 'test_mzml_writter.py' appears to have a typo. Consider renaming it to 'test_mzml_writer.py' for consistency.
import os
spectrum_list.set("count", str(spectrum_count)) | ||
spectrum_list.set("defaultDataProcessingRef", "alpharaw_processing") | ||
|
||
print(f"Writing {spectrum_count} spectra...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider replacing the print statement with a logging framework call to better integrate with production logging practices.
Copilot uses AI. Check for mistakes.
@mlorenz49 Do we need mzml reader more than the writer? I mean, what is the use senario of this writer? |
The use case is when we manipulate/filter raw files, e.g. when we're working with hybridDIA data but want to use common tools from the community for down stream processing. In this case, we can't rely solely on an .hdf file format because most software doesn't support this format. mzML files are a good solution here because they are the community standard which most tools can work with and there is a consens on how these files should look like (https://www.psidev.info/mzml). |
I see, thank you for quick response. |
…ML writer Introduce CVTermProcessor class that dynamically loads controlled vocabulary terms from the bundled psi-ms.owl file using owlready2. This ensures mzML output uses official PSI-MS ontology labels rather than hardcoded strings. Key design features: - CVTermProcessor reads psi-ms.owl file from package directory at runtime - Predefined CV_TERM_MAPPING links semantic concepts to MS accession IDs - Dynamic label extraction from OWL ensures accuracy and currency - Fallback handling for missing terms maintains robustness - Generated constants dictionary provides both accessions and official names - Support for complex mappings (activation methods, analyzer types) This architecture separates semantic mapping (what concepts we need) from label resolution (what the official terms are called), enabling the mzML writer to produce standards-compliant output that automatically stays synchronized with PSI-MS vocabulary updates. Code review improvements: - testing module only uses pytest functionality - Make instance variables private for better encapsulation - Add proper type annotations and resolve circular imports - Change default binary precision to 32-bit to match internal data - Remove unused parameters and improve documentation
|
||
CV_TERM_MAPPING = { | ||
# Spectrum types | ||
"CENTROIDED": "MS:1000127", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed: put all strings that we "own" into string constants,
preferably in a class
class CvKeys:
CENTROIDED = "CENTROIDED"
and use them whereever needed
# ============================================================================ | ||
|
||
|
||
class CVTermProcessor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed: move out the functionality into a dedicated module
label = result.label[0] | ||
|
||
return { | ||
"id": ms_id.replace("_", ":"), # Convert back to MS:1000127 format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think about introducing string constants for "id" and "label"
h5py==3.11.0 | ||
numba==0.59.1 | ||
pandas==2.2.2 | ||
numpy==1.26.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on psi-ms.owl:
move to a resources
directory, add a readme file which indicates the origin and version (e.g. commit hash) of it ..
accession, name = self._cv["ACTIVATION_METHODS"][activation_method] | ||
self._add_cv_param(activation, self._cv["CV_MS"], accession, name, "") | ||
elif "ACCESSION_DISSOCIATION_METHOD" in self._cv: | ||
# Fallback for unknown activation methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log here?
if cv_term: | ||
result[var_name] = cv_term | ||
else: | ||
# Fallback for missing terms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise here?
at least log
Overview
Major refactoring of the MzMLWriter class to create a cleaner, more maintainable implementation with improved PSI-MS controlled vocabulary compliance. Note that this PR is supposed to be a replacement of #98.
Key Changes
🔧 Core Improvements
cv_constants.py
module with all PSI-MS controlled vocabulary terms✅ CV Compliance Fixes
📝 Code Quality
🧪 Testing
Breaking Changes
indexed
parameter fromsave_mzml
methodCompatibility:
with MSConvert