Skip to content

feat(indexers): add framework for default hooks#140

Open
lukeroantreeONS wants to merge 1 commit intomainfrom
139-default-hooks
Open

feat(indexers): add framework for default hooks#140
lukeroantreeONS wants to merge 1 commit intomainfrom
139-default-hooks

Conversation

@lukeroantreeONS
Copy link
Collaborator

📌 Add framework for flexible, common hooks to be provided as part of the package

✨ Summary

Introduces a hooks submodule within indexers to provide robust, flexible hooks to the user to cover common pre- and post-processing tasks, and provide a base class should an advanced user want to implement custom ones.

Note: this does not remove the ability of a user to define a hook function themselves, without relying on the base class, as is currently the recommended approach.

📜 Changes Introduced

  • feat(indexers): hook_factory defines base classes and shared utilities
  • feat(indexers): default_hooks contains particular flexible pre- or post-processing hooks we offer as part of the package

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

  1. re-build the package with uv build
  2. re-install classifai from dist/classifai<version>.whl with any optional dependencies you want to test with
  3. create a test.py file as described below:
from classifai.indexers import VectorStore
from classifai.indexers.dataclasses import VectorStoreSearchInput
from classifai.indexers.hooks import CapitalisationStandardisingHook
from classifai.vectorisers import HuggingFaceVectoriser

demo_vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

default_hook_capitalise = CapitalisationStandardisingHook(method="upper")

demo_vectorstore = VectorStore(
    file_name="./data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=demo_vectoriser,
    output_dir="./demo_vdb",
    overwrite=True,
    hooks={"search_preprocess": default_hook_capitalise},
)

query_df = VectorStoreSearchInput({"id": [1, 2], "query": ["apple merchant", "pub landlord"]})

results = demo_vectorstore.search(query_df, n_results=5)

print(results)
  1. run uv run test.py; confirm it runs successfully, and the query_text column in the output is all capitalised.
  2. repeat, changing the method parameter within the CapitalisationStandardisingHook object initialisation

@lukeroantreeONS lukeroantreeONS requested a review from a team as a code owner March 5, 2026 15:54
@lukeroantreeONS lukeroantreeONS linked an issue Mar 5, 2026 that may be closed by this pull request
@github-actions github-actions bot added the enhancement New feature or request label Mar 5, 2026
Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I've made a few comments about a few minor changes that might be good but it would also likely be fine as is.

Only one major issue I found was that the default pre-processing capitalisation hook is accepted as a post-processing hook, provided the colname=doc_text (or any legal colname of the SearchOutput df).

Should this be updated somehow so that the vectorstore rejects default hooks of type A being applied at type B hook locations?

The following shows how I produced this:

default_hook_capitalise = CapitalisationStandardisingHook(method="upper", colname="doc_text")

demo_vectorstore = VectorStore(
    file_name="./DEMO/data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=demo_vectoriser,
    output_dir="./demo_vdb",
    overwrite=True,
    hooks={"search_postprocess": default_hook_capitalise},
)

Finally, the demo hooks notebook works perfectly well still. Should it be updated to document these changes?


def __init__(self, **kwargs):
"""Sets any attributes required by the hook."""
self.hook_type: str = "generic" # Placeholder for hook type, can be overridden by subclasses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this just be left as a required parameter rather than a have a default value that's outside the range of values?

self.kwargs = kwargs
self._setup()

def _setup(self): # noqa: B027
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see why this is needed particularly. See the example pre-processing hook Capitilisation hook, where all setup code is called in the constructor and then the _setup method is called but is blank. What situations would it provide benefit to have this extra method, and would it be more eloquent to just have the constructor and the call method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this file now? we don't have any postprocessing hooks

"""Inititialises the hook with the specified method for standardising capitalisation.

Args:
method (str): Method for standardisation. Must be one of "lower", "upper", "sentence"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from running the demo, its slightly difficult to understand what each of these do - sentence and title in particular. Better documentation? but if they are all str class methods then maybe documenting each one here is not necessary

class HookBase(ABC):
"""Abstract base class for all post-processing hooks requiring customisation."""

def __init__(self, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flexibility of this

class CapitalisationStandardisingHook(HookBase):
"""A pre-processing hook to handle upper-/lower-/sentence-/title-casing."""

def __init__(self, method: str = "lower", colname: str = "query"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be changed to accept a list of colnames so that it could work on metadata columns as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default Hooks

2 participants