Skip to content

Add FORM-based source provider#379

Draft
barnaliy wants to merge 3 commits intomainfrom
barnali/provider-based-on-form
Draft

Add FORM-based source provider#379
barnaliy wants to merge 3 commits intomainfrom
barnali/provider-based-on-form

Conversation

@barnaliy
Copy link
Contributor

@barnaliy barnaliy commented Mar 3, 2026

Adds form_source.cpp implementing a FORM-based provider that retrieves data products from FORM storage into Phlex.
This is the source-side complement to form_module.cpp.

@barnaliy
Copy link
Contributor Author

barnaliy commented Mar 3, 2026

@phlexbot format

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic header-guards fixes were necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic cmake-format fixes were necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic markdownlint fixes were necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Automatic clang-format fixes pushed (commit 388095a).
⚠️ Note: Some issues may require manual review and fixing.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic jsonnetfmt fixes were necessary.

@barnaliy barnaliy requested review from gemmeren, knoepfel and pcanal March 3, 2026 16:00
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 0% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
form/form_source.cpp 0.00% 42 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
- Coverage   85.36%   84.96%   -0.41%     
==========================================
  Files         122      120       -2     
  Lines        2433     2441       +8     
  Branches      389      390       +1     
==========================================
- Hits         2077     2074       -3     
- Misses        233      249      +16     
+ Partials      123      118       -5     
Flag Coverage Δ
unittests 84.96% <0.00%> (-0.41%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
form/form_source.cpp 0.00% <0.00%> (ø)

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0435987...24a97c6. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@barnaliy barnaliy requested a review from aolivier23 March 3, 2026 16:17
@gemmeren gemmeren requested a review from wwuoneway March 3, 2026 16:29
// --- Extract configuration ---
std::string const input_file = config.get<std::string>("input_file");
std::string const creator = config.get<std::string>("creator");
std::string const tech_string = config.get<std::string>("technology", "ROOT_TTREE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer-term, you may want to do some rethinking and approach this in terms of major & minor technology.
To open a e.g. ROOT (major technology) file, FORM doesn't need to know the minor technology (e.g. TTree vs RNT). The minor technology can easily be discovered. That would avoid having the user specify the minor technology of the input and make configurations more portable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — noted for future improvement. For Prototype 0.1, the minor technology is explicitly specified in config to keep things simple. Auto-discovery of minor technology is a good goal for a future iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for this PR and the first version, but maybe we can create a git issue?

// product_query fields must be compile-time _id literals, not runtime strings.
// To add new products, add new s.provide() blocks below following the same pattern.

s.provide("provide_i",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be needed for every data product we read from input? If so, we may want to think about a way to shorten or auto-generate these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already noted in the FIXME comment — it's a known limitation of Prototype 0.1 that will be addressed as the type list grows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too I wouldn't mind a git issue

if (!pb.data) {
throw std::runtime_error("FORM read returned null data for product: i");
}
return *static_cast<int const*>(pb.data);
Copy link
Contributor

@aolivier23 aolivier23 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line leak memory? I think the lambda function returns a copy of pb.data (because of the trailing return type) constructed from a reference to pb.data. But I think the memory pointed to by
pb.data itself is never de-allocated because it's stored in a raw pointer.

I'm working on a way to verify my understanding. I think something like valgrind will give us a strong answer if I can get it to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some r-value/move-semantics trickery here that I haven't fully wrapped my mind around. Line 66 looks like an r-value to me at first glance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the memory pointed to by pb.data itself is never de-allocated because it's stored in a raw pointer.

However it comes from the product_with_name where FORM may (or may not) have put a view (pointer to) the data 'owned' by the TTree or RNTuple (either way, this should be documented in product_with_name and if it is not a view but owns the data then it should deleted).

Related (possibly) future point (in this example this is moot since this 'just' an int), the provider should not be doing a copy of the potentially extremely large data but should instead pass ownership to Phlex. [Whether this is 'always' the case or just in case of non-trivial data is yet another question]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the memory in pb.data is allocated here on each call to read(): https://github.com/Framework-R-D/phlex/blob/main/form/root_storage/root_tbranch_container.cpp#L167

There's a similar version that uses TClass::New() for objects that are not fundamental types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design question. The memory should be freed within FORM, not by the caller. form_source.cpp calls form_interface::read(), making the API user/caller. My understanding is the allocator should be the deallocator. Since FORM allocates the memory inside read(), FORM should also free it — not the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the memory in pb.data is allocated here on each call to read()

I see. So indeed the product_with_name "owns" the data and should destruct it or, better yet, give ownership to phlex.

Side note that the code at https://github.com/Framework-R-D/phlex/blob/main/form/root_storage/root_tbranch_container.cpp#L167 is missing a reset of the branch address (as is the TTree is holding a reference to the address of a stack variable and thus a reference to invalid address after the end of the function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, instead of FORM freeing the memory itself, it should transfer ownership to Phlex — because Phlex knows the product lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, you actually need both in case the think is read but never past on to Phlex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the discussion regarding memory management: I've created an issue to address this here:
https://github.com/Framework-R-D/phlex/issues/392

@barnaliy
Copy link
Contributor Author

barnaliy commented Mar 3, 2026

@phlexbot format

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic header-guards fixes were necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic jsonnetfmt fixes were necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

No automatic markdownlint fixes were necessary.

Copy link
Contributor

@wwuoneway wwuoneway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a structural concern regarding the separation between form_module and form_source, as both currently communicate directly with Phlex. form_source appears to focus on reading input files, while form_module likely handles the integration with Phlex. Should form_source be a standalone module that communicates directly with Phlex, or would it be more appropriate as an internal component used by form_module?

form::experimental::config::output_item_config input_cfg;
form::experimental::config::tech_setting_config tech_cfg;
input_cfg.addItem("i", input_file, technology);
input_cfg.addItem("j", input_file, technology);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of debating what to hardcode ( here, placeholder i and j, or meaningful names), it would be preferable to make the code automatically detect available items from the input file. This would eliminate hardcoding entirely and make the module truly reusable and scalable. A real input file may contain many possible items, and production code should ideally adapt dynamically instead assuming a fixed set. I would suggest moving such hardcoded i and j to test code, while the production code should be intelligent enough to discover available items automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach is simple and explicit for Prototype 0.1. Auto-discovery is the right long-term goal, but there are two concrete blockers that go beyond the scope of this PR:

  1. product_query fields require compile-time _id literals and do not accept runtime strings, making it impossible to loop over a dynamically discovered list of products
  2. FORM has no enumeration API to discover available items from an input file

I propose keeping the current approach for Prototype 0.1 and tracking auto-discovery as a follow-up issue once these blockers are resolved.

}
return *static_cast<int const*>(pb.data);
})
.output_product(product_query{.creator = "input"_id, .layer = "event"_id, .suffix = "j"_id});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two provider lambdas for i and j are almost identical, with only the product name string differing. Consider extracting the common logic into a factory function or template to reduce code duplication and improve maintainability. This would also make it easier to add more products in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication is not an oversight.
The lambda body duplication is minimal and intentional for clarity in Prototype 0.1
The output_product() call cannot be generalized due to product_query compile-time _id literal constraint — already noted in FIXME.

@barnaliy
Copy link
Contributor Author

barnaliy commented Mar 5, 2026

I have a structural concern regarding the separation between form_module and form_source, as both currently communicate directly with Phlex. form_source appears to focus on reading input files, while form_module likely handles the integration with Phlex. Should form_source be a standalone module that communicates directly with Phlex, or would it be more appropriate as an internal component used by form_module?

The strong reason to keep them separate is that Phlex itself makes this distinction at the framework level — form_module uses PHLEX_REGISTER_ALGORITHMS while form_source uses PHLEX_REGISTER_PROVIDERS. These are fundamentally different constructs with different roles in the framework. Merging them would fight the framework design.

I prefer to keep form_module and form_source separate. That said, what do you think @pcanal and @knoepfel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants