Skip to content

Add back REF/ALT to csv output in addition to resolved GT fields #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion src/genomicsdb_processor.h
Original file line number Diff line number Diff line change
Expand Up @@ -92,15 +92,26 @@ class ColumnarVariantCallProcessor : public GenomicsDBVariantCallProcessor {
const int64_t* coordinates,
const genomic_interval_t& genomic_interval,
const std::vector<genomic_field_t>& genomic_fields);
void process_str_field(const std::string& field_name, PyObject *calls, int dims, npy_intp *sizes) {
auto found = std::find(m_field_names.begin(), m_field_names.end(), field_name);
if (found != m_field_names.end()) {
PyDict_SetItem(calls, PyUnicode_FromString(field_name.c_str()),
PyArray_SimpleNewFromData(dims, sizes, NPY_OBJECT, m_string_fields[field_name].data()));
}
}
PyObject* construct_data_frame() {
int dims = 1;
npy_intp sizes[1] = { static_cast<npy_intp>(m_sample_names.size()) };
PyObject *calls = PyDict_New();
PyDict_SetItem(calls, PyUnicode_FromString("Sample"), PyArray_SimpleNewFromData(dims, sizes, NPY_OBJECT, m_sample_names.data()));
PyDict_SetItem(calls, PyUnicode_FromString("CHR"), PyArray_SimpleNewFromData(dims, sizes, NPY_OBJECT, m_chrom.data()));
PyDict_SetItem(calls, PyUnicode_FromString("POS"), PyArray_SimpleNewFromData(dims, sizes, NPY_INT64, m_pos.data()));

// Process REF, ALT and GT first.
process_str_field("REF", calls, dims, sizes);
process_str_field("ALT", calls, dims, sizes);
process_str_field("GT", calls, dims, sizes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question: can you tell me why we process REF, ALT, GT first instead of in the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could be in any order based on how GenomicsDB processes them. We could let the loop handle it, but sometimes, the order is ALT,REF,..,GT. I just thought it nicer to have REF before ALT, guess it does not matter for GT as much.. But, I can process them in the loop if that is OK too. Let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can't decide. My initial concern was that this bit of code gets invoked for each variant so trying to minimize any extraneous code. But that's probably micro-optimization...thoughts?

Copy link
Member Author

@nalinigans nalinigans Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code is not invoked per variant call, it happens only once per query - see. The calls themselves are already populated as m_string_fields in the dictionary backed by std::map<std::string, std::vector<PyObject *>> m_string_fields as part of the code here. That said, I could probably order them at initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - thanks for that clarification, I misread this as being part of process somehow. I think this is good to merge the way it is if you want.

for (auto field_name: m_field_names) {
if (field_name == "REF" || field_name == "ALT" || field_name == "GT") continue;
if (m_string_fields.find(field_name) != m_string_fields.end()) {
PyDict_SetItem(calls, PyUnicode_FromString(field_name.c_str()),
PyArray_SimpleNewFromData(dims, sizes, NPY_OBJECT, m_string_fields[field_name].data()));
Expand Down
2 changes: 1 addition & 1 deletion src/genomicsdb_processor_columnar.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ void ColumnarVariantCallProcessor::process(const interval_t& interval) {
for (auto& field_type_pair : *genomic_field_types) {
std::string field_name = field_type_pair.first;
genomic_field_type_t field_type = field_type_pair.second;
if (!field_name.compare("END") || !field_name.compare("REF") || !field_name.compare("ALT")) {
if (!field_name.compare("END")) {
continue;
}
m_field_names.push_back(field_name);
Expand Down