feat(indexers): add framework for default hooks#140
feat(indexers): add framework for default hooks#140lukeroantreeONS wants to merge 1 commit intomainfrom
Conversation
…ed as part of the package
frayle-ons
left a comment
There was a problem hiding this comment.
Looks really good! I've made a few comments about a few minor changes that might be good but it would also likely be fine as is.
Only one major issue I found was that the default pre-processing capitalisation hook is accepted as a post-processing hook, provided the colname=doc_text (or any legal colname of the SearchOutput df).
Should this be updated somehow so that the vectorstore rejects default hooks of type A being applied at type B hook locations?
The following shows how I produced this:
default_hook_capitalise = CapitalisationStandardisingHook(method="upper", colname="doc_text")
demo_vectorstore = VectorStore(
file_name="./DEMO/data/fake_soc_dataset.csv",
data_type="csv",
vectoriser=demo_vectoriser,
output_dir="./demo_vdb",
overwrite=True,
hooks={"search_postprocess": default_hook_capitalise},
)
Finally, the demo hooks notebook works perfectly well still. Should it be updated to document these changes?
|
|
||
| def __init__(self, **kwargs): | ||
| """Sets any attributes required by the hook.""" | ||
| self.hook_type: str = "generic" # Placeholder for hook type, can be overridden by subclasses |
There was a problem hiding this comment.
couldn't this just be left as a required parameter rather than a have a default value that's outside the range of values?
| self.kwargs = kwargs | ||
| self._setup() | ||
|
|
||
| def _setup(self): # noqa: B027 |
There was a problem hiding this comment.
I can't see why this is needed particularly. See the example pre-processing hook Capitilisation hook, where all setup code is called in the constructor and then the _setup method is called but is blank. What situations would it provide benefit to have this extra method, and would it be more eloquent to just have the constructor and the call method?
There was a problem hiding this comment.
do we need this file now? we don't have any postprocessing hooks
| """Inititialises the hook with the specified method for standardising capitalisation. | ||
|
|
||
| Args: | ||
| method (str): Method for standardisation. Must be one of "lower", "upper", "sentence" |
There was a problem hiding this comment.
from running the demo, its slightly difficult to understand what each of these do - sentence and title in particular. Better documentation? but if they are all str class methods then maybe documenting each one here is not necessary
| class HookBase(ABC): | ||
| """Abstract base class for all post-processing hooks requiring customisation.""" | ||
|
|
||
| def __init__(self, **kwargs): |
There was a problem hiding this comment.
I like the flexibility of this
| class CapitalisationStandardisingHook(HookBase): | ||
| """A pre-processing hook to handle upper-/lower-/sentence-/title-casing.""" | ||
|
|
||
| def __init__(self, method: str = "lower", colname: str = "query"): |
There was a problem hiding this comment.
could this be changed to accept a list of colnames so that it could work on metadata columns as well?
📌 Add framework for flexible, common hooks to be provided as part of the package
✨ Summary
Introduces a
hookssubmodule withinindexersto provide robust, flexible hooks to the user to cover common pre- and post-processing tasks, and provide a base class should an advanced user want to implement custom ones.Note: this does not remove the ability of a user to define a hook function themselves, without relying on the base class, as is currently the recommended approach.
📜 Changes Introduced
hook_factorydefines base classes and shared utilitiesdefault_hookscontains particular flexible pre- or post-processing hooks we offer as part of the package✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
uv builddist/classifai<version>.whlwith any optional dependencies you want to test withtest.pyfile as described below:uv run test.py; confirm it runs successfully, and thequery_textcolumn in the output is all capitalised.methodparameter within theCapitalisationStandardisingHookobject initialisation