diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index a1af8ed..5ffe4ba 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -48,7 +48,7 @@ repos: name: deptry (uv) language: system pass_filenames: false # deptry expects a project path, not filenames - entry: uv run deptry . + entry: uv run deptry --per-rule-ignores "DEP003=plum,DEP004=quartodoc|numpydoc" . - id: forbid-new-init name: Check if __init__.py is added to the src folder diff --git a/DEMO/README.md b/DEMO/README.md index e14a41e..4533ca7 100644 --- a/DEMO/README.md +++ b/DEMO/README.md @@ -1,33 +1,14 @@ -# Demo for `classifai` +# Overview of Demonstrations & Examples This directory contains a set of Jupyter notebooks designed to help you understand and use `classifai` effectively. -## Prerequisites - -You may wish to download each notebook individually and the demo dataset individually - each notebook contains specific installation instructions on how to set up an environemnt and download the package - -## Running the Demo - -To start the demo, launch Jupyter Notebook or JupyterLab from your terminal in this directory: - -```bash -jupyter notebook -``` - -Or, if you prefer JupyterLab: - -```bash -jupyter lab -``` - -Then, open the notebooks in your browser. -We recommend going through the the `general_workflow_demo.ipynb` notebook for a broad overview of the package before moving onto the `custom_vectoriser.ipynb` notebook, which covers a more advanced use-case. +--- ## Notebooks Overview -This demo includes two Jupyter notebooks: +This demo series includes several Jupyter notebooks: -### 1. `general_workflow_demo.ipynb` +### 1. ✨ ClassifAI Demo - Introduction & Basic Usage ✨ : `general_workflow_demo.ipynb` This introduces the core features of `classifai`. @@ -43,7 +24,7 @@ It covers: This notebook is intended for prospective users to get a quick overview of what the package can do, and as a 'jumping off point' for new projects. -### 2. `custom_vectoriser.ipynb` +### 2. Creating Your Own Vectoriser : `custom_vectoriser.ipynb` This notebook demonstrates how to create a new, custom Vectoriser by extending the base `VectoriserBase` class. @@ -55,7 +36,7 @@ It covers: This notebook is for users who want to implement a vectorisation approach not covered by our existing suite of Vectorisers. -### 3. `custom_preprocessing_and_postprocessing_hooks.ipynb` +### 3. VectorStore pre- and post- processing logic with _Hooks_ 🪝 : `custom_preprocessing_and_postprocessing_hooks.ipynb` This notebook demostrates how to add custom Python code logic to the VectorStore search pipeline, such as performing spell checking on user input, without breaking the data flow of the ClassifAI VectorStore. @@ -70,3 +51,86 @@ It covers: * Examples of different kinds of hooks that can be written - [spellchecking, deduplicating results, adding extra info to results based on result ids] --- + +## Installation of classifai + +#### *0)* [optional] Create and activate a virtual environment from the command line + +##### Using pip + venv + +Create a virtual environment: + +`python -m venv .venv` + +##### Using UV + +Create a virtual environment: + +`uv venv` + +##### Activating your environment + +(macOS / Linux): + +`source .venv/bin/activate` + +Activate it (Windows): + +`source .venv/Scripts/activate` + +#### *1)* Install the classifai package + +##### Using pip + +`pip install "https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl"` + +##### Using uv + +one-off: + +`uv pip install "https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl"` + +add as project dependency: + +`uv add "https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl"` + + +#### *2)* Install optional dependencies + +##### Using pip + +`pip install "classifai[]"` + +where `` is one or more of `huggingface`,`gcp`,`ollama`, or `all` to install all of them. +##### Using uv + +one-off installation + +`uv pip install "classifai[]"` + +add as project dependency + +`uv add "classifai[]"` + +--- + +## Prerequisites + +You may wish to download each notebook individually and the demo dataset individually - each notebook contains specific installation instructions on how to set up an environemnt and download the package + +## Running the Demo + +To start the demo, launch Jupyter Notebook or JupyterLab from your terminal in this directory: + +```bash +jupyter notebook +``` + +Or, if you prefer JupyterLab: + +```bash +jupyter lab +``` + +Then, open the notebooks in your browser. +We recommend going through the the `general_workflow_demo.ipynb` notebook for a broad overview of the package before moving onto the `custom_vectoriser.ipynb` notebook, which covers a more advanced use-case. \ No newline at end of file diff --git a/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb b/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb index 6b7e200..5e8ae2e 100644 --- a/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb +++ b/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb @@ -71,11 +71,12 @@ "The use of these dataclasses both helps the user of the package to understand what data needs to be provided to the Vectorstore and how a user should interact with the objects being returned by these VectorStore functions. Additionally, this ensures robustness of the package by checking that the correct columns are present in the data before operating on it. \n", "\n", "The reverse_search() and embed() VectorStore functions have their own input and output data classes with their own validity column data checks. The names of each set are intuitively:\n", - "| **VectorStore Method** | **Input Dataclass** | **Output Dataclass** |\n", + "\n", + "| **VectorStore Method** | **Input Dataclass** | **Output Dataclass** |\n", "|-------------------------------|-----------------------------|-----------------------------|\n", - "| `VectorStore.search()` | `VectorStoreSearchInput` | `VectorStoreSearchOutput` |\n", + "| `VectorStore.search()` | `VectorStoreSearchInput` | `VectorStoreSearchOutput` |\n", "| `VectorStore.reverse_search()` | `VectorStoreReverseSearchInput` | `VectorStoreReverseSearchOutput` |\n", - "| `VectorStore.embed()` | `VectorStoreEmbedInput` | `VectorStoreEmbedOutput` |\n", + "| `VectorStore.embed()` | `VectorStoreEmbedInput` | `VectorStoreEmbedOutput` |\n", "\n", "Users of the package can use the schema of each of these input and output dataclasses to understand how to interface with these main methods of the VectorStore class.\n", "\n" @@ -145,92 +146,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Installation (pre-release)\n", - "\n", - "`Classifai` is currently in **pre-release** and is **not yet published on PyPI**. \n", - "This section describes how to install the packaged **wheel** from the project’s public GitHub Releases so that you can follow through this DEMO and try the code yourself.\n", - "\n", - "#### 1) Create and activate a virtual environment in command line\n", - "\n", - "##### Using `pip` + `venv`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "python -m venv .venv\n", - "```\n", - "\n", - "##### Using `UV`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "uv venv\n", - "```\n", - "\n", - "Activate the created environment with \n", - "\n", - "(macOS / Linux):\n", - "```bash\n", - "source .venv/bin/activate\n", - "```\n", - "Activate it (Windows):\n", - "```bash\n", - "source .venv/Scripts/activate\n", - "```\n", - "\n", - "---\n", - "\n", - "#### 2) Install the pre-release wheel\n", - "\n", - "##### Using `pip`\n", - "```bash\n", - "pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", - "##### Using `uv`\n", - "```bash\n", - "uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", "---\n", "\n", - "#### 3) Install optional dependencies (`[huggingface]`)\n", - "\n", - "Finally, for this demo we will be using the Huggingface Library to download embedding models - we therefore need an optional dependency of the Classifai Pacakge:\n", - "\n", - "##### Using `pip`\n", - "```bash\n", - "pip install \"classifai[huggingface]\"\n", - "```\n", - "\n", - "##### Using `uv pip`\n", - "```bash\n", - "uv pip install \"classifai[huggingface]\"\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Assuming the step one virtual environemnt is set up and actiavted and ready in the terminal, run the following commands to install the classifai package and the huggingface dependencies.\n", - "## PIP\n", - "#!pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "#!pip install \"classifai[huggingface]\"\n", - "\n", - "## UV\n", - "#!uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "#!uv pip install \"classifai[huggingface]\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Note! :\n", - "\n", - "You may need to install the ipykernel python package to run Notebook cells with your Python environment" + "#### If you can run the following cell in this notebook, you should be good to go!" ] }, { @@ -239,29 +157,20 @@ "metadata": {}, "outputs": [], "source": [ - "#!pip install ipykernel\n", + "from classifai.vectorisers import HuggingFaceVectoriser\n", "\n", - "#!uv pip install ipykernel" + "print(\"done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "---\n", - "\n", - "#### If you can run the following cell in this notebook, you should be good to go!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from classifai.vectorisers import HuggingFaceVectoriser\n", + "#### Alternatively, to test without running a notebook, run the following from your command line; \n", "\n", - "print(\"done!\")" + "```shell\n", + "python -c \"import classifai\"\n", + "```" ] }, { @@ -761,7 +670,7 @@ ], "metadata": { "kernelspec": { - "display_name": "classifai", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -775,9 +684,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.7" + "version": "3.12.10" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/DEMO/custom_vectoriser.ipynb b/DEMO/custom_vectoriser.ipynb index eeb94f4..41846d9 100644 --- a/DEMO/custom_vectoriser.ipynb +++ b/DEMO/custom_vectoriser.ipynb @@ -36,113 +36,6 @@ "* The custom One-Hot Encoding Vectoriser being used with the Indexer module to create and search a VectorStore" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installation (pre-release)\n", - "\n", - "`Classifai` is currently in **pre-release** and is **not yet published on PyPI**. \n", - "This section describes how to install the packaged **wheel** from the project’s public GitHub Releases so that you can follow through this DEMO and try the code yourself.\n", - "\n", - "### 1) Create and activate a virtual environment in command line\n", - "\n", - "#### Using `pip` + `venv`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "python -m venv .venv\n", - "```\n", - "\n", - "#### Using `UV`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "uv venv\n", - "```\n", - "\n", - "Activate the created environment with \n", - "\n", - "(macOS / Linux):\n", - "```bash\n", - "source .venv/bin/activate\n", - "```\n", - "Activate it (Windows):\n", - "```bash\n", - "source .venv/Scripts/activate\n", - "```\n", - "\n", - "---\n", - "\n", - "### 2) Install the pre-release wheel\n", - "\n", - "#### Using `pip`\n", - "```bash\n", - "pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", - "#### Using `uv`\n", - "```bash\n", - "uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", - "---" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Assuming the step one virtual environemnt is set up and actiavted and ready in the terminal, run the following commands to install the classifai package\n", - "## PIP\n", - "#!pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "\n", - "## UV\n", - "#!uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Note! :\n", - "\n", - "You may need to install the ipykernel python package to run Notebook cells with your Python environment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#!pip install ipykernel\n", - "\n", - "#!uv pip install ipykernel" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "### If you can run the following cell in this notebook, you should be good to go!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from classifai.vectorisers import VectoriserBase\n", - "\n", - "print(\"done!\")" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -211,6 +104,9 @@ "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", + "# importing the base Vectoriser class, which we will build upon\n", + "from classifai.vectorisers import VectoriserBase\n", + "\n", "\n", "class OneHotVectoriser(VectoriserBase):\n", " def __init__(self, vocabulary: list[str]):\n", @@ -434,7 +330,7 @@ ], "metadata": { "kernelspec": { - "display_name": "classifai", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -448,7 +344,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.7" + "version": "3.12.10" } }, "nbformat": 4, diff --git a/DEMO/general_workflow_demo.ipynb b/DEMO/general_workflow_demo.ipynb index 580dc15..a31eca0 100644 --- a/DEMO/general_workflow_demo.ipynb +++ b/DEMO/general_workflow_demo.ipynb @@ -8,7 +8,7 @@ } }, "source": [ - "# ✨ ClassifAI Demo ✨\n", + "# ✨ ClassifAI Demo - Introduction & Basic Usage ✨\n", "\n", "---\n", "\n", @@ -23,130 +23,6 @@ "#### ClassifAI provides three key modules to address these, letting you build Rest-API search systems from your text data" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installation (pre-release)\n", - "\n", - "`Classifai` is currently in **pre-release** and is **not yet published on PyPI**. \n", - "This section describes how to install the packaged **wheel** from the project’s public GitHub Releases so that you can follow through this DEMO and try the code yourself.\n", - "\n", - "### 1) Create and activate a virtual environment in command line\n", - "\n", - "#### Using `pip` + `venv`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "python -m venv .venv\n", - "```\n", - "\n", - "#### Using `UV`\n", - "Create a virtual environment:\n", - "\n", - "```bash\n", - "uv venv\n", - "```\n", - "\n", - "Activate the created environment with \n", - "\n", - "(macOS / Linux):\n", - "```bash\n", - "source .venv/bin/activate\n", - "```\n", - "Activate it (Windows):\n", - "```bash\n", - "source .venv/Scripts/activate\n", - "```\n", - "\n", - "---\n", - "\n", - "### 2) Install the pre-release wheel\n", - "\n", - "#### Using `pip`\n", - "```bash\n", - "pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", - "#### Using `uv`\n", - "```bash\n", - "uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "```\n", - "\n", - "---\n", - "\n", - "### 3) Install optional dependencies (`[huggingface]`)\n", - "\n", - "Finally, for this demo we will be using the Huggingface Library to download embedding models - we therefore need an optional dependency of the Classifai Pacakge:\n", - "\n", - "#### Using `pip`\n", - "```bash\n", - "pip install \"classifai[huggingface]\"\n", - "```\n", - "\n", - "#### Using `uv pip`\n", - "```bash\n", - "uv pip install \"classifai[huggingface]\"\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Assuming the step one virtual environemnt is set up and actiavted and ready in the terminal, run the following commands to install the classifai package and the huggingface dependencies.\n", - "## PIP\n", - "#!pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "#!pip install \"classifai[huggingface]\"\n", - "\n", - "## UV\n", - "#!uv pip install \"https://github.com/datasciencecampus/classifai/releases/download/v0.2.1/classifai-0.2.1-py3-none-any.whl\"\n", - "#!uv pip install \"classifai[huggingface]\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Note! :\n", - "\n", - "You may need to install the ipykernel python package to run Notebook cells with your Python environment\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#!pip install ipykernel\n", - "\n", - "#!uv pip install ipykernel" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "### If you can run the following cell in this notebook, you should be good to go!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from classifai.vectorisers import HuggingFaceVectoriser\n", - "\n", - "print(\"done!\")" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -179,7 +55,7 @@ "from classifai.vectorisers import VectoriserBase\n", "```\n", "\n", - "There is another DEMO notebook, called `custom_vectoriser.ipynb` which provides a walk through of extending the base class to make a custom TF-IDF Vectoriser model.\n", + "There is another DEMO notebook, called `custom_vectoriser.ipynb` which provides a walk through of extending the base class to make a custom One-Hot Encoder Vectoriser model.\n", "\n", "---" ] @@ -199,6 +75,8 @@ "metadata": {}, "outputs": [], "source": [ + "from classifai.vectorisers import HuggingFaceVectoriser\n", + "\n", "# Our embedding model is pulled down from HuggingFace, or used straight away if previously downloaded\n", "# This also works with many different huggingface models!\n", "vectoriser = HuggingFaceVectoriser(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n", diff --git a/README.md b/README.md index d174b9b..e5c0d87 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ # ClassifAI -ClassifAI is a Python package that simplifies semantic search and Retrieval Augmented Generation (RAG) pipelines for classification tasks in the production of official statistics. It is designed to help data professionals build applications and pipelines to label new text samples to official statistical classifications, by leveraging (augmented) semantic search over a knowledgebase of previously coded examples. +ClassifAI is a free, open-source (MIT Licence) Python package that simplifies semantic search and Retrieval Augmented Generation (RAG) pipelines for classification tasks in the production of official statistics. It is designed to help data professionals build applications and pipelines to label new text samples to official statistical classifications, by leveraging (augmented) semantic search over a knowledgebase of previously coded examples. The Office for National Statistics often needs to classify free-text survey responses or other data to standard statistical classifications. The most well-known examples include the Standard Industrial Classification ([SIC](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic)), the Standard Occupational Classification ([SOC](https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc)), and the Classification Of Individual COnsumption according to Purpose ([COICOP](https://en.wikipedia.org/wiki/Classification_of_Individual_Consumption_According_to_Purpose)), as well as international equivalents such as [ISCO](https://esco.ec.europa.eu/en/about-esco/escopedia/escopedia/international-standard-classification-occupations-isco) and [ISIC](https://en.wikipedia.org/wiki/International_Standard_Industrial_Classification). The ClassifAI package has been designed specifically to help us build applications, pipelines and analyses for this kind of task. @@ -82,11 +82,11 @@ pip install "classifai[] @ https://github.com/datasciencecam ``` ##### Astral UV -one-off add to environment: +One-off add to environment: ```bash uv pip install "classifai[] @ https://github.com/datasciencecampus/classifai/releases/download/v/classifai--py3-none-any.whl" ``` -persist as an environment requirement: +Persist as an environment requirement: ```bash uv add "classifai[] @ https://github.com/datasciencecampus/classifai/releases/download/v/classifai--py3-none-any.whl" ``` @@ -140,12 +140,23 @@ print(results) #### Step 4: Deploy as a REST API -You can use ClassifAI as a local package, or deploy it as an API server, using FastAPI. +In addition to using ClassifAI as a local package, you can use it to create / attach to a FastAPI REST API. +You can create a new FastAPI application which you can modify as required, connect it to an existing FastAPI +application, or deploy it immediately as an REST API service using `uvicorn`. ```python -from classifai.servers import start_api +from classifai.servers import get_server, get_router, run_server -start_api(vector_stores=[vector_store], endpoint_names=["Occupations"], port=8000) +... + +# Create and expose the API routing to be attached to an existing FastAPI service; +app = get_server(vector_stores=[vector_store], endpoint_names=["Occupations"]) + +# Create and expose the API routing to be attached to an existing FastAPI service; +routing = get_server(vector_stores=[vector_store], endpoint_names=["Occupations"]) + +# Directly spin up a new locally-hosted FastAPI server with `uvicorn`; +run_server(vector_stores=[vector_store], endpoint_names=["Occupations"], port=8000) ``` #### Learn more diff --git a/_quarto.yml b/_quarto.yml index 478522e..fa57385 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -18,6 +18,7 @@ project: website: title: "ClassifAI" + favicon: ons_favicon.ico page-navigation: true navbar: background: light diff --git a/ons_favicon.ico b/ons_favicon.ico new file mode 100644 index 0000000..95b6135 Binary files /dev/null and b/ons_favicon.ico differ diff --git a/src/classifai/indexers/__init__.py b/src/classifai/indexers/__init__.py index 4bc8680..c7f6ac0 100644 --- a/src/classifai/indexers/__init__.py +++ b/src/classifai/indexers/__init__.py @@ -1,30 +1,32 @@ # pylint: disable=C0301 -"""This module provides functionality for creating a vector index from a CSV (text) +"""This module provides functionality for creating a `VectorStore` from a CSV (text) file. It defines the `VectorStore` class, which is used to model and create vector databases -from CSV text files using a vectoriser object. +from CSV text files using a `Vectoriser` object. -This class requires a Vectoriser object from the vectorisers submodule, +This class requires a `Vectoriser` object from the vectorisers submodule, to convert the CSV's text data into vector embeddings which are then stored in the -VectorStore objects. +`VectorStore` objects. Key Features: -- Batch processing of input files to handle large datasets. -- Support for CSV file format (additional formats may be added in future updates). -- Integration with a custom embedder for generating vector embeddings. -- Logging for tracking progress and handling errors during processing. + + - Batch processing of input files to handle large datasets. + - Support for CSV file format (additional formats may be added in future updates). + - Integration with a custom embedder for generating vector embeddings. + - Logging for tracking progress and handling errors during processing. VectorStore Class: -- The `VectorStore` class is initialized with a vectoriser object and a CSV knowledgebase. -- Additional columns in the CSV may be specified as metadata to be included in the vector database. -- Upon creation, the VectorStore is saved in parquet format for efficient, and quick - reloading via the VectorStore's `.from_filespace()` method. -- A new piece of text data (or label) can be queried against the VectorStore in the following ways: + + - The `VectorStore` class is initialized with a `Vectoriser` object and a CSV knowledgebase. + - Additional columns in the CSV may be specified as metadata to be included in the vector database. + - Upon creation, the `VectorStore` is saved in parquet format for efficient, and quick + reloading via the `VectorStore`'s `.from_filespace()` method. + - A new piece of text data (or label) can be queried against the `VectorStore` in the following ways: - `.search()`: to find the most semantically similar pieces of text in the vector database. - `.reverse_search()`: to find all examples in the knowledgebase that have a given label. - `.embed()`: to generate a vector embedding for a given piece of text data. -- 'Hook' methods may be specified to perform pre-processing on input data before embedding, - and post-processing on the output of the search methods. + - 'Hook' methods may be specified to perform pre-processing on input data before embedding, + and post-processing on the output of the search methods. """ from .dataclasses import ( diff --git a/src/classifai/indexers/dataclasses.py b/src/classifai/indexers/dataclasses.py index d3f450b..b57cdff 100644 --- a/src/classifai/indexers/dataclasses.py +++ b/src/classifai/indexers/dataclasses.py @@ -1,3 +1,7 @@ +"""This module defines dataclasses for structuring and validating input and output data for +`VectorStore` search, reverse_search and embedding operations in the ClassifAI framework. +""" + import numpy as np import pandas as pd import pandera.pandas as pa @@ -130,8 +134,7 @@ def score(self) -> pd.Series: class VectorStoreReverseSearchInput(pd.DataFrame): - """DataFrame-like object for forming and validating reverse search query - input data. + """DataFrame-like object for forming and validating reverse search query input data. This class validates and represents input for reverse searches, which find similar documents to a given document in the vector store. @@ -286,8 +289,7 @@ def text(self) -> pd.Series: class VectorStoreEmbedOutput(pd.DataFrame): - """DataFrame-like object for storing and validating embedded vectors and associated - metadata. + """DataFrame-like object for storing and validating embedded vectors and associated metadata. This class represents the output of embedding operations, containing the original text data along with their computed vector embeddings. diff --git a/src/classifai/indexers/main.py b/src/classifai/indexers/main.py index 76e2b91..14d64ee 100644 --- a/src/classifai/indexers/main.py +++ b/src/classifai/indexers/main.py @@ -1,29 +1,32 @@ # pylint: disable=C0301 -"""This module provides functionality for creating a vector index from a text file. +"""This module provides functionality for creating a `VectorStore` from a CSV (text) +file. It defines the `VectorStore` class, which is used to model and create vector databases -from CSV text files using a vectoriser object. +from CSV text files using a `Vectoriser` object. -This class interacts with the Vectoriser class from the vectorisers submodule, -expecting that any vector model used to generate embeddings used in the -VectorStore objects is an instance of one of these classes, most notably -that each vectoriser object should have a transform method. +This class requires a `Vectoriser` object from the vectorisers submodule, +to convert the CSV's text data into vector embeddings which are then stored in the +VectorStore objects. Key Features: -- Batch processing of input files to handle large datasets. -- Support for CSV file format (additional formats may be added in future updates). -- Integration with a custom embedder for generating vector embeddings. -- Logging for tracking progress and handling errors during processing. - -Dependencies: -- polars: For handling data in tabular format and saving it as a Parquet file. -- tqdm: For displaying progress bars during batch processing. -- numpy: for vector cosine similarity calculations -- A custom file iterator (`iter_csv`) for reading input files in batches. - -Usage: -This module is intended to be used with the Vectoriers mdodule and the -the servers module from ClassifAI, to created scalable, modular, searchable -vector databases from your own text data. + + - Batch processing of input files to handle large datasets. + - Support for CSV file format (additional formats may be added in future updates). + - Integration with a custom embedder for generating vector embeddings. + - Logging for tracking progress and handling errors during processing. + +VectorStore Class: + + - The `VectorStore` class is initialized with a `Vectoriser` object and a CSV knowledgebase. + - Additional columns in the CSV may be specified as metadata to be included in the vector database. + - Upon creation, the `VectorStore` is saved in parquet format for efficient, and quick + reloading via the `VectorStore`'s `.from_filespace()` method. + - A new piece of text data (or label) can be queried against the `VectorStore` in the following ways: + - `.search()`: to find the most semantically similar pieces of text in the vector database. + - `.reverse_search()`: to find all examples in the knowledgebase that have a given label. + - `.embed()`: to generate a vector embedding for a given piece of text data. + - 'Hook' methods may be specified to perform pre-processing on input data before embedding, + and post-processing on the output of the search methods. """ import json @@ -64,56 +67,57 @@ class VectorStore: - """A class to model and create 'VectorStore' objects for building and searching vector databases from CSV text files. + """A class to model and create `VectorStore` objects for building and searching vector databases from CSV text files. Attributes: - file_name (str): the original file with the knowledgebase to build the vector store - data_type (str): the data type of the original file (curently only csv supported) - vectoriser (object): A Vectoriser object from the corresponding ClassifAI Pacakge module + file_name (str): the data file contatining the knowledgebase to build the `VectorStore` + data_type (str): the data type of the data file (curently only csv supported) + vectoriser (VectoriserBase): A `Vectoriser` object from the corresponding ClassifAI Pacakge module batch_size (int): the batch size to pass to the vectoriser when embedding meta_data (dict): key-value pairs of metadata to extract from the input file and their correpsonding types - output_dir (str): the path to the output directory where the VectorStore will be saved - vectors (np.array): a numpy array of vectors for the vector DB + output_dir (str): the path to the output directory where the `VectorStore` will be saved + vectors (np.array): a numpy array of vectors for the vector database vector_shape (int): the dimension of the vectors - num_vectors (int): how many vectors are in the vector store - vectoriser_class (str): the type of vectoriser used to create embeddings + num_vectors (int): the number of records saved in the `VectorStore` + vectoriser_class (str): the type of `Vectoriser` used to create embeddings hooks (dict): A dictionary of user-defined hooks for preprocessing and postprocessing. """ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915 self, - file_name, - data_type, - vectoriser, - batch_size=8, - meta_data=None, - output_dir=None, - overwrite=False, - hooks=None, + file_name: str, + data_type: str, + vectoriser: VectoriserBase, + batch_size: int = 8, + meta_data: dict | None = None, + output_dir: str | None = None, + overwrite: bool = False, + hooks: dict | None = None, ): - """Initializes the VectorStore object by processing the input CSV file and generating + """Initializes the `VectorStore` object by processing the input CSV file and generating vector embeddings. Args: file_name (str): The name of the input CSV file. data_type (str): The type of input data (currently supports only "csv"). - vectoriser (object): The vectoriser object used to transform text into - vector embeddings. + vectoriser (object): The `Vectoriser` object used to transform text into + vector embeddings. batch_size (int): [optional] The batch size for processing the input file and batching to vectoriser. Defaults to 8. meta_data (dict): [optional] key,value pair metadata column names to extract from the input file and their types. - Defaults to None. - output_dir (str): [optional] The directory where the vector store will be saved. - Defaults to None, where input file name will be used. - overwrite (bool): [optional] If True, allows overwriting existing folders with the same name. Defaults to false to prevent accidental overwrites. - hooks (dict): [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to None. + Defaults to `None`. + output_dir (str): [optional] The directory where the `VectorStore` will be saved. + Defaults to `None`, where input file name will be used. + overwrite (bool): [optional] If `True`, allows overwriting existing folders with the same name. + Defaults to `False` to prevent accidental overwrites. + hooks (dict): [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to `None`. Raises: - ClassifaiError: For any unexpected errors during initialization, with context for debugging. - DataValidationError: If input arguments are invalid or if there are issues with the input file. - ConfigurationError: If there are configuration issues, such as output directory problems. - IndexBuildError: If there are failures during index building or saving outputs. + `ClassifaiError`: For any unexpected errors during initialization, with context for debugging. + `DataValidationError`: If input arguments are invalid or if there are issues with the input file. + `ConfigurationError`: If there are configuration issues, such as output directory problems. + `IndexBuildError`: If there are failures during index building or saving outputs. """ # ---- Input validation (caller mistakes) -> DataValidationError / ConfigurationError if not isinstance(file_name, str) or not file_name.strip(): @@ -218,14 +222,14 @@ def __init__( # noqa: C901, PLR0912, PLR0913, PLR0915 ) from e def _save_metadata(self, path: str): - """Saves metadata about the vector store to a JSON file. + """Saves metadata about the `VectorStore` to a JSON file. Args: path (str): The file path where the metadata JSON file will be saved. Raises: - DataValidationError: If the path argument is invalid. - IndexBuildError: If there are failures during serialization or file writing. + `DataValidationError`:` If the path argument is invalid. + `IndexBuildError`: If there are failures during serialization or file writing. """ if not isinstance(path, str) or not path.strip(): raise DataValidationError("path must be a non-empty string.", context={"path": path}) @@ -259,15 +263,15 @@ def _save_metadata(self, path: str): def _create_vector_store_index(self): # noqa: C901 """Processes text strings in batches, generates vector embeddings, and creates the - vector store. + `VectorStore`. Called from the constructor once other metadata has been set. Iterates over data in batches, stores batch data and generated embeddings. Creates a Polars DataFrame with the captured data and embeddings, and saves it as a Parquet file in the output_dir attribute, and stores in the vectors attribute. Raises: - DataValidationError: If there are issues reading or validating the input file. - IndexBuildError: If there are failures during embedding or building the vectors table. + `DataValidationError`: If there are issues reading or validating the input file. + `IndexBuildError`: If there are failures during embedding or building the vectors table. """ # ---- Reading source data (validation/format issues) -> DataValidationError / IndexBuildError try: @@ -353,18 +357,20 @@ def _create_vector_store_index(self): # noqa: C901 ) from e def embed(self, query: VectorStoreEmbedInput) -> VectorStoreEmbedOutput: - """Converts text into vector embeddings using the vectoriser and returns a VectorStoreEmbedOutput dataframe with columns 'id', 'text', and 'embedding'. + """Converts text (provided via a `VectorStoreEmbedInput` object) into vector embeddings using the `Vectoriser` and + returns a `VectorStoreEmbedOutput` dataframe with columns `id`, `text`, and `embedding`. Args: - query (VectorStoreEmbedInput): The VectorStoreEmbedInput object containing the strings to be embedded and their ids. + query (VectorStoreEmbedInput): The `VectorStoreEmbedInput` object containing the strings to be embedded and their ids. Returns: - VectorStoreEmbedOutput: The output object containing the embeddings along with their corresponding ids and texts. + (VectorStoreEmbedOutput): The `VectorStoreEmbedOutput` object containing the embeddings along with their corresponding + ids and texts. Raises: - DataValidationError: Raised if invalid arguments are passed. - HookError: Raised if user-defined hooks fail. - ClassifaiError: Raised if embedding operation fails. + `DataValidationError`: Raised if invalid arguments are passed. + `HookError`: Raised if user-defined hooks fail. + `ClassifaiError`: Raised if embedding operation fails. """ # ---- Validate arguments (caller mistakes) -> DataValidationError if not isinstance(query, VectorStoreEmbedInput): @@ -428,23 +434,26 @@ def embed(self, query: VectorStoreEmbedInput) -> VectorStoreEmbedOutput: def reverse_search( # noqa: C901 self, query: VectorStoreReverseSearchInput, max_n_results: int = 100, partial_match: bool = False ) -> VectorStoreReverseSearchOutput: - """Reverse searches the vector store using a VectorStoreReverseSearchInput object - and returns matched results in VectorStoreReverseSearchOutput object. + """Reverse searches the `VectorStore` using a `VectorStoreReverseSearchInput` object + and returns matched results in `VectorStoreReverseSearchOutput` object. If using partial matching, matches if document label starts with query label. Args: - query (VectorStoreReverseSearchInput): A VectorStoreReverseSearchInput object containing the text query or list of queries to search for with ids. - max_n_results (int): [optional] Number of top results to return for each query, set to -1 to return all results. Default 100. - partial_match (bool): [optional] Set the search behaviour to use `join_where` to match query checks that document id `startsWith` query. Default False + query (VectorStoreReverseSearchInput): A `VectorStoreReverseSearchInput` object containing the text query or + list of queries to search for with ids. + max_n_results (int): [optional] Number of top results to return for each query, set to -1 to return all results. + Defaults to 100. + partial_match (bool): [optional] If `True`, the search behaviour is set to return results where the `document_id` + is prefixed by the query. Defaults to `False`. Returns: - result_df (VectorStoreReverseSearchOutput): A VectorStoreReverseSearchOutput object containing reverse search results with columns for query ID, query text, - document ID, document text and any associated metadata columns. + (VectorStoreReverseSearchOutput): A `VectorStoreReverseSearchOutput` object containing reverse search + results with columns for `query_id`, `query_text`, `document_id`, `document_text` and any associated metadata columns. Raises: - DataValidationError: Raised if invalid arguments are passed. - HookError: Raised if user-defined hooks fail. - ClassifaiError: Raised if reverse search operation fails. + `DataValidationError`: Raised if invalid arguments are passed. + `HookError`: Raised if user-defined hooks fail. + `ClassifaiError`: Raised if reverse search operation fails. """ # ---- Validate arguments (caller mistakes) -> DataValidationError if not isinstance(query, VectorStoreReverseSearchInput): @@ -537,24 +546,24 @@ def reverse_search( # noqa: C901 return result_df def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> VectorStoreSearchOutput: # noqa: C901, PLR0912, PLR0915 - """Searches the vector store using queries from a VectorStoreSearchInput object and returns - ranked results in VectorStoreSearchOutput object. In batches, converts users text queries into vector embeddings, + """Searches the `VectorStore` using queries from a `VectorStoreSearchInput` object and returns + ranked results in `VectorStoreSearchOutput` object. In batches, converts users text queries into vector embeddings, computes cosine similarity with stored document vectors, and retrieves the top results. Args: - query (VectorStoreSearchInput): A VectoreStoreSearchInput object containing the text query or list of queries to search for with ids. + query (VectorStoreSearchInput): A `VectorStoreSearchInput` object containing the text query or list of queries to search for with ids. n_results (int): [optional] Number of top results to return for each query. Default 10. batch_size (int): [optional] The batch size for processing queries. Default 8. Returns: - result_df (VectorStoreSearchOutput): A VectorStoreSearchOutput object containing search results with columns for query ID, query text, - document ID, document text, rank, score, and any associated metadata columns. + (VectorStoreSearchOutput): A `VectorStoreSearchOutput` object containing search results with columns for `query_id`, `query_text`, + `document_id`, `document_text`, `rank`, `score`, and any associated metadata columns. Raises: - DataValidationError: Raised if invalid arguments are passed. - ConfigurationError: Raised if the vector store is not initialized. - HookError: Raised if user-defined hooks fail. - VectorisationError: Raised if embedding queries fails. + `DataValidationError`: Raised if invalid arguments are passed. + `ConfigurationError`: Raised if the vector store is not initialized. + `HookError`: Raised if user-defined hooks fail. + `VectorisationError`: Raised if embedding queries fails. """ # ---- Validate arguments (caller mistakes) -> DataValidationError if not isinstance(query, VectorStoreSearchInput): @@ -718,16 +727,16 @@ def from_filespace(cls, folder_path, vectoriser, hooks: dict | None = None): # Args: folder_path (str): The folder path containing the metadata and Parquet files. - vectoriser (object): The vectoriser object used to transform text into vector embeddings. + vectoriser (object): The `Vectoriser` object used to transform text into vector embeddings. hooks (dict): [optional] A dictionary of user-defined hooks for preprocessing and postprocessing. Defaults to None. Returns: (VectorStore): An instance of the `VectorStore` class. Raises: - DataValidationError: If input arguments are invalid or if there are issues with the metadata or Parquet files. - ConfigurationError: If there are configuration issues, such as vectoriser mismatches. - IndexBuildError: If there are failures during loading or parsing the files. + `DataValidationError`: If input arguments are invalid or if there are issues with the metadata or Parquet files. + `ConfigurationError`: If there are configuration issues, such as `Vectoriser` mismatches. + `IndexBuildError`: If there are failures during loading or parsing the files. """ # ---- Validate arguments (caller mistakes) -> DataValidationError / ConfigurationError if not isinstance(folder_path, str) or not folder_path.strip(): diff --git a/src/classifai/servers/__init__.py b/src/classifai/servers/__init__.py index 4ca8968..4441178 100644 --- a/src/classifai/servers/__init__.py +++ b/src/classifai/servers/__init__.py @@ -1,10 +1,16 @@ """This module provides functionality for creating or extending a REST-API service -which allows a user to call the search methods of one or more VectorStore objects, +which allows a user to call the search methods of one or more `VectorStore` objects, from an API endpoint. -These functions interact with the ClassifAI Indexer module's VectorStore objects, +These functions interact with the ClassifAI Indexer module's `VectorStore` objects, such that their `embed`, `search` and `reverse_search` methods are exposed on -REST-API endpoints, via a FastAPI service. +REST-API endpoints, via a FastAPI app. + +Full API documentation for FastAPI endpoints and Pydantic Models +can be found in autogenerated app Swagger docs at `/docs`. +To do this without providing your own data, run the initial demo to make a test +`VectorStore` with `/DEMO/general_workflow_demo.ipynb` +then run it as a demo server at `/DEMO/general_workflow_serve.py`. """ from .main import get_router, get_server, make_endpoints, run_server diff --git a/src/classifai/servers/main.py b/src/classifai/servers/main.py index 181a0eb..d31f660 100644 --- a/src/classifai/servers/main.py +++ b/src/classifai/servers/main.py @@ -37,18 +37,18 @@ def get_router(vector_stores: list[VectorStore], endpoint_names: list[str]) -> APIRouter: - """Create and return a FastAPI router with search endpoints. + """Create and return a `FastAPI.APIRouter` with search endpoints. Args: - vector_stores (list[VectorStore]): A list of vector store objects, each responsible for handling embedding and search operations for a specific endpoint. + vector_stores (list[VectorStore]): A list of `VectorStore` objects, each responsible for handling embedding and search operations for a specific endpoint. endpoint_names (list[str]): A list of endpoint names corresponding to the vector stores. Returns: - APIRouter: Router with intialized search endpoints + (APIRouter): Router with intialized search endpoints Raises: - DataValidationError: If the input parameters are invalid. - ConfigurationError: If a vector store is missing required methods. + `DataValidationError`: Raised if the input parameters are invalid. + `ConfigurationError`: Raised if one or more of the `vector_stores` are invalid. """ # ---- Validate startup args -> DataValidationError / ConfigurationError @@ -96,7 +96,7 @@ def docs(): """Redirect users to the API documentation page. Returns: - RedirectResponse: A response object that redirects the user to the `/docs` page. + (RedirectResponse): A response object that redirects the user to the `/docs` page. """ start_page = RedirectResponse(url="/docs") return start_page @@ -105,14 +105,14 @@ def docs(): def get_server(vector_stores: list[VectorStore], endpoint_names: list[str]) -> FastAPI: - """Create and return a FastAPI server with search endpoints. + """Create and return a `FastAPI` server with search endpoints. Args: - vector_stores (list[VectorStore]): A list of vector store objects, each responsible for handling embedding and search operations for a specific endpoint. - endpoint_names (list[str]): A list of endpoint names corresponding to the vector stores. + vector_stores (list[VectorStore]): A list of `VectorStore` objects, each responsible for handling embedding and search operations for a specific endpoint. + endpoint_names (list[str]): A list of endpoint names corresponding to the `VectorStore`s to be exposed. Returns: - FastAPI: Server with intialized search endpoints + (FastAPI): Server with intialized search endpoints """ logging.info("Generating ClassifAI API") @@ -123,12 +123,15 @@ def get_server(vector_stores: list[VectorStore], endpoint_names: list[str]) -> F def run_server(vector_stores: list[VectorStore], endpoint_names: list[str], port: int = 8000): - """Create and run a FastAPI server with search endpoints. + """Create and run a `FastAPI` server with search endpoints. Args: - vector_stores (list[VectorStore]): A list of vector store objects, each responsible for handling embedding and search operations for a specific endpoint. - endpoint_names (list[str]): A list of endpoint names corresponding to the vector stores. + vector_stores (list[VectorStore]): A list of `VectorStore` objects, each responsible for handling embedding and search operations for a specific endpoint. + endpoint_names (list[str]): A list of endpoint names corresponding to the `VectorStore`s to be exposed. port (int): [optional] The port on which the API server will run. Defaults to 8000. + + Raises: + `DataValidationError`: Raised if the input parameters are invalid, e.g. `port` value is out of bounds. """ logging.info("Starting ClassifAI API") @@ -158,10 +161,10 @@ def make_endpoints(router: APIRouter | FastAPI, vector_stores_dict: dict[str, Ve def _create_embedding_endpoint(router: APIRouter | FastAPI, endpoint_name: str, vector_store: VectorStore): - """Create and register an embedding endpoint for a specific vector store. + """Create and register an embedding endpoint for a specific `VectorStore`. Args: - router (APIRouter | FastAPI): The FastAPI application instance. + router (APIRouter | FastAPI): The `FastAPI` application instance. endpoint_name (str): The name of the endpoint to be created. vector_store: The vector store object responsible for generating embeddings. @@ -191,12 +194,12 @@ async def embedding_endpoint(data: ClassifaiData) -> EmbeddingsResponseBody: def _create_search_endpoint(router: APIRouter | FastAPI, endpoint_name: str, vector_store: VectorStore): - """Create and register a search endpoint for a specific vector store. + """Create and register a search endpoint for a specific `VectorStore`. Args: - router (APIRouter | FastAPI): The FastAPI application instance. + router (APIRouter | FastAPI): The `FastAPI` application instance. endpoint_name (str): The name of the endpoint to be created. - vector_store: The vector store object responsible for performing search operations. + vector_store: The `VectorStore` object responsible for performing search operations. The created endpoint accepts POST requests with input data and a query parameter specifying the number of results to return. It performs a search operation using @@ -233,9 +236,9 @@ def _create_reverse_search_endpoint(router: APIRouter | FastAPI, endpoint_name: """Create and register a reverse_search endpoint for a specific vector store. Args: - router (APIRouter | FastAPI): The FastAPI application instance. + router (APIRouter | FastAPI): The `FastAPI` application instance. endpoint_name (str): The name of the endpoint to be created. - vector_store: The vector store object responsible for performing search operations. + vector_store: The `VectorStore` object responsible for performing search operations. The created endpoint accepts POST requests with input data and a query parameter specifying the number of results to return. It performs a reverse search operation using diff --git a/src/classifai/vectorisers/__init__.py b/src/classifai/vectorisers/__init__.py index 1d99312..e133505 100644 --- a/src/classifai/vectorisers/__init__.py +++ b/src/classifai/vectorisers/__init__.py @@ -1,5 +1,6 @@ # pylint: disable=C0301 -"""This module provides classes for creating and utilizing embedding models from different services. +"""This module provides classes for creating and utilizing embedding models from user-created solutions or +third-party services. The Vectoriser module offers a unified interface to interact with various other ClassifAI Package Modules. Generally Vectorisers are used to convert text data into numerical embeddings that can be used for machine learning tasks. @@ -9,9 +10,10 @@ # Vectoriser Overview In our Package, Vectoriser have a simple role: - - Take in text data (as a string or list of strings) - - Output numerical embeddings (as a numpy array) - - Each Vectortiser should provide a `transform` method to perform this conversion. + + * Take in text data (as a string or list of strings) + * Output numerical embeddings (as a numpy array) + * Each Vectortiser should provide a `transform` method to perform this conversion. It is possible for users to implement their own Vectoriser classes by inheriting from the `VectoriserBase` abstract base class and implementing the `transform` method. @@ -21,14 +23,15 @@ ########################### # Implemented Vectorisers -We provide several quick implementations of Vectorisers that interface with popular services and libraries. +We provide several robust implementations of Vectorisers that interface with popular services and libraries. The module contains the following 'ready-made' classes: -- `GcpVectoriser`: A class for embedding text using Google Cloud Platform's GenAI API. -- `HuggingFaceVectoriser`: A general wrapper class for Huggingface Transformers -models to generate text embeddings. -- `OllamaVectoriser`: A general wrapper class for using a locally running ollama -server to generate text embeddings. + + * `GcpVectoriser`: A class for embedding text using Google Cloud Platform's GenAI API. + * `HuggingFaceVectoriser`: A general wrapper class for Huggingface Transformers + models to generate text embeddings. + * `OllamaVectoriser`: A general wrapper class for using a locally running ollama + server to generate text embeddings. Each class is designed to interface with a specific service that provides embedding model functionality. @@ -41,6 +44,14 @@ These classes abstract the underlying implementation details, providing a simple and consistent interface for embedding text using different services. + +########################### +########################### +# Further Reading + +The "Creating Your Own Vectoriser" demo (`DEMO/custom_vectoriser.ipynb`) notebook contains detailed +instructions / examples on implementation of custom `Vectoriser`s, and using them to within `VectorStore` +objects. """ from .base import VectoriserBase diff --git a/src/classifai/vectorisers/gcp.py b/src/classifai/vectorisers/gcp.py index 73735c7..10f55d3 100644 --- a/src/classifai/vectorisers/gcp.py +++ b/src/classifai/vectorisers/gcp.py @@ -61,7 +61,7 @@ def __init__( **client_kwargs: [optional] Additional keyword arguments to pass to the GenAI client. Raises: - ConfigurationError: If the GenAI client fails to initialize. + `ConfigurationError`: If the GenAI client fails to initialize. """ check_deps(["google-genai"], extra="gcp") from google import genai # type: ignore @@ -100,8 +100,8 @@ def transform(self, texts: str | list[str]) -> np.ndarray: numpy.ndarray: A 2D array of embeddings, where each row corresponds to an input text. Raises: - ExternalServiceError: If the GenAI API request fails. - VectorisationError: If the response format from the GenAI API is unexpected. + `ExternalServiceError`: If the GenAI API request fails. + `VectorisationError`: If the response format from the GenAI API is unexpected. """ # If a single string is passed as arg to texts, convert to list if isinstance(texts, str): diff --git a/src/classifai/vectorisers/huggingface.py b/src/classifai/vectorisers/huggingface.py index f7c569a..5ac4d66 100644 --- a/src/classifai/vectorisers/huggingface.py +++ b/src/classifai/vectorisers/huggingface.py @@ -11,6 +11,11 @@ class HuggingFaceVectoriser(VectoriserBase): """A general wrapper class for Huggingface Transformers models to generate text embeddings. + The `HuggingFaceVectoriser` accepts most encoder-based models from the Huggingface Transformers library, + and provides a simple interface to generate embeddings from text data. Additional configuration options, + such as `trust_remote` or a HuggingFaceAPI token can be passed via the `tokenizer_kwargs` and `model_kwargs` + parameters. + Attributes: model_name (str): The name of the Huggingface model to use. tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the specified model. @@ -38,8 +43,8 @@ def __init__( model_kwargs (dict): [optional] Additional keyword arguments to pass to the model. Defaults to None. Raises: - ExternalServiceError: If the model or tokenizer cannot be loaded. - ConfigurationError: If the model cannot be initialized on the specified device. + `ExternalServiceError`: If the model or tokenizer cannot be loaded. + `ConfigurationError`: If the model cannot be initialized on the specified device. """ check_deps(["transformers", "torch"], extra="huggingface") import torch # type: ignore @@ -100,7 +105,7 @@ def transform(self, texts: str | list[str]) -> np_ndarray: numpy.ndarray: A 2D array of embeddings, where each row corresponds to an input text. Raises: - VectorisationError: If tokenization, model inference, or embedding extraction fails. + `VectorisationError`: If tokenization, model inference, or embedding extraction fails. """ import torch # type: ignore diff --git a/src/classifai/vectorisers/ollama.py b/src/classifai/vectorisers/ollama.py index 983fff7..4202c4a 100644 --- a/src/classifai/vectorisers/ollama.py +++ b/src/classifai/vectorisers/ollama.py @@ -11,6 +11,12 @@ class OllamaVectoriser(VectoriserBase): """A wrapper class allowing a locally-running ollama server to generate text embeddings. + The `OllamaVectoriser` interacts with a locally-running Ollama server, which must be set + up by the user separately. + In general, Ollama can run the same encoder-based models as the `HuggingFaceVectoriser`. + A future goal is to extend the `OllamaVectoriser` to interface with an _external_ Ollama + server, allowing separation of embedding generation from the user's local environment. + Attributes: model_name (str): The name of the local model to use. """ @@ -38,8 +44,8 @@ def transform(self, texts: str | list[str]) -> np.ndarray: numpy.ndarray: A 2D array of embeddings, where each row corresponds to an input text. Raises: - ExternalServiceError: If the Ollama service fails to generate embeddings. - VectorisationError: If embedding extraction from the Ollama response fails. + `ExternalServiceError`: If the Ollama service fails to generate embeddings. + `VectorisationError`: If embedding extraction from the Ollama response fails. """ import ollama # type: ignore