Skip to content

Conversation

mlorenz49
Copy link
Contributor

Overview

Major refactoring of the MzMLWriter class to create a cleaner, more maintainable implementation with improved PSI-MS controlled vocabulary compliance. Note that this PR is supposed to be a replacement of #98.

Key Changes

🔧 Core Improvements

  • Remove indexing functionality and multiprocessing for simplified base implementation
  • Add comprehensive type annotations and detailed docstrings throughout
  • Create cv_constants.py module with all PSI-MS controlled vocabulary terms
  • Consolidate binary array creation logic and improve error handling

CV Compliance Fixes

  • Ensure all CV terms reference official PSI-MS controlled vocabulary

📝 Code Quality

  • Extract all CV constants to dedicated module for maintainability
  • Add comprehensive type hints for better IDE support and static analysis
  • Improve parameter documentation with detailed docstrings
  • Better separation of concerns and cleaner method organization

🧪 Testing

  • Update test suite with CV compliance validation
  • Add tests for binary precision, compression, and integer formatting
  • Test edge cases including empty spectra handling
  • Validate XML structure and required mzML elements

Breaking Changes

  • ⚠️ Remove indexed parameter from save_mzml method
  • ⚠️ Remove multiprocessing options for cleaner interface

Compatibility:

  • mzML files work when importing into Skyline and searching with MSFragger
  • these mzML files can be searched with DIANN after conversion to indexed mzML
    with MSConvert

Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include:

- Support for both regular and indexed mzML output
- Parallel processing of spectrum data for improved performance
- Correct implementation of controlled vocabulary parameters with required value attributes
- Binary encoding of m/z and intensity arrays with optional zlib compression
- Support for centroid and profile spectra
- Complete metadata representation including instrument configuration, file description, and processing information
- Proper calculation of byte offsets for efficient random access in indexed files
- Automatic generation of SHA-1 file checksums
- XML formatting according to the PSI standard specification

This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format.

Added unit tests to validate the writer's functionality and output file structure compliance.
Major refactoring of the MzMLWriter class to create a cleaner, more maintainable implementation:

**Core Changes:**
- Remove indexing functionality and multiprocessing for simplified base implementation
- Add comprehensive type annotations and detailed docstrings throughout
- Create cv_constants.py module with all PSI-MS controlled vocabulary terms
- Fix CV term mappings to comply with official PSI-MS ontology

**CV Compliance Fixes:**
- Correct activation method identifiers (ETD: MS:1000598, ECD: MS:1000250, etc.)
- Update software identification to MS:1001456 "analysis software" for public alpharaw
- Fix integer formatting for MS level and charge state parameters
- Add proper analyzer type mappings and scan list CV terms
- Ensure all CV terms reference official PSI-MS controlled vocabulary

**Code Quality Improvements:**
- Extract all CV constants to dedicated module for maintainability
- Add comprehensive type hints for better IDE support and static analysis
- Improve error handling and edge case management
- Consolidate binary array creation logic
- Add detailed parameter documentation

**Testing:**
- Update comprehensive test suite with CV compliance validation
- Add tests for binary precision, compression, and integer formatting
- Test edge cases including empty spectra handling
- Validate XML structure and required mzML elements

**Breaking Changes:**
- Remove `indexed` parameter from save_mzml method
- Remove multiprocessing options for cleaner interface

**Compatibility:**
- mzML files work when importing into Skyline and searching with MSFragger
- these mzML files can be searched with DIANN after conversion to indexed mzML
with MSConvert
@mlorenz49 mlorenz49 self-assigned this Jun 26, 2025
@mlorenz49 mlorenz49 added the enhancement New feature or request label Jun 26, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the MzMLWriter implementation to improve maintainability and PSI‑MS controlled vocabulary compliance, while also updating tests and consolidating CV constants.

  • Core improvements in the MzMLWriter with enhanced type annotations and clearer structure
  • CV compliance fixes and extraction of CV constants to a dedicated module
  • Updates to the test suite and integration via the MSData_Base class

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
tests/unit/test_mzml_writter.py New unit tests for verifying mzML output and CV compliance
alpharaw/mzml_io/mzml_writer.py Refactored mzMLWriter implementation with improved structure
alpharaw/mzml_io/cv_constants.py Added dedicated module for PSI‑MS controlled vocabulary
alpharaw/ms_data_base.py Integrated save_mzml method using the new MzMLWriter
Comments suppressed due to low confidence (1)

tests/unit/test_mzml_writter.py:1

  • The filename 'test_mzml_writter.py' appears to have a typo. Consider renaming it to 'test_mzml_writer.py' for consistency.
import os

spectrum_list.set("count", str(spectrum_count))
spectrum_list.set("defaultDataProcessingRef", "alpharaw_processing")

print(f"Writing {spectrum_count} spectra...")
Copy link
Preview

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider replacing the print statement with a logging framework call to better integrate with production logging practices.

Copilot uses AI. Check for mistakes.

@jalew188
Copy link
Collaborator

@mlorenz49 Do we need mzml reader more than the writer? I mean, what is the use senario of this writer?

@mlorenz49
Copy link
Contributor Author

@mlorenz49 Do we need mzml reader more than the writer? I mean, what is the use senario of this writer?

The use case is when we manipulate/filter raw files, e.g. when we're working with hybridDIA data but want to use common tools from the community for down stream processing. In this case, we can't rely solely on an .hdf file format because most software doesn't support this format. mzML files are a good solution here because they are the community standard which most tools can work with and there is a consens on how these files should look like (https://www.psidev.info/mzml).

@jalew188
Copy link
Collaborator

@mlorenz49 Do we need mzml reader more than the writer? I mean, what is the use senario of this writer?

The use case is when we manipulate/filter raw files, e.g. when we're working with hybridDIA data but want to use common tools from the community for down stream processing. In this case, we can't rely solely on an .hdf file format because most software doesn't support this format. mzML files are a good solution here because they are the community standard which most tools can work with and there is a consens on how these files should look like (https://www.psidev.info/mzml).

I see, thank you for quick response.

…ML writer

Introduce CVTermProcessor class that dynamically loads controlled vocabulary
terms from the bundled psi-ms.owl file using owlready2. This ensures mzML
output uses official PSI-MS ontology labels rather than hardcoded strings.

Key design features:
- CVTermProcessor reads psi-ms.owl file from package directory at runtime
- Predefined CV_TERM_MAPPING links semantic concepts to MS accession IDs
- Dynamic label extraction from OWL ensures accuracy and currency
- Fallback handling for missing terms maintains robustness
- Generated constants dictionary provides both accessions and official names
- Support for complex mappings (activation methods, analyzer types)

This architecture separates semantic mapping (what concepts we need) from
label resolution (what the official terms are called), enabling the mzML
writer to produce standards-compliant output that automatically stays
synchronized with PSI-MS vocabulary updates.

Code review improvements:
- testing module only uses pytest functionality
- Make instance variables private for better encapsulation
- Add proper type annotations and resolve circular imports
- Change default binary precision to 32-bit to match internal data
- Remove unused parameters and improve documentation
@mlorenz49 mlorenz49 requested a review from mschwoer July 9, 2025 07:06

CV_TERM_MAPPING = {
# Spectrum types
"CENTROIDED": "MS:1000127",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed: put all strings that we "own" into string constants,
preferably in a class

class CvKeys:
  CENTROIDED = "CENTROIDED"

and use them whereever needed

# ============================================================================


class CVTermProcessor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed: move out the functionality into a dedicated module

label = result.label[0]

return {
"id": ms_id.replace("_", ":"), # Convert back to MS:1000127 format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think about introducing string constants for "id" and "label"

h5py==3.11.0
numba==0.59.1
pandas==2.2.2
numpy==1.26.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on psi-ms.owl:

move to a resources directory, add a readme file which indicates the origin and version (e.g. commit hash) of it ..

accession, name = self._cv["ACTIVATION_METHODS"][activation_method]
self._add_cv_param(activation, self._cv["CV_MS"], accession, name, "")
elif "ACCESSION_DISSOCIATION_METHOD" in self._cv:
# Fallback for unknown activation methods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log here?

if cv_term:
result[var_name] = cv_term
else:
# Fallback for missing terms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise here?

at least log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants