Skip to content

- Adding InferenceService for semantic reranking.#48258

Open
mbhaskar wants to merge 2 commits intoAzure:mainfrom
mbhaskar:semantic-reranking
Open

- Adding InferenceService for semantic reranking.#48258
mbhaskar wants to merge 2 commits intoAzure:mainfrom
mbhaskar:semantic-reranking

Conversation

@mbhaskar
Copy link
Member

@mbhaskar mbhaskar commented Mar 5, 2026

Description

This PR adds semantic reranking functionality to the java SDK

  • new beta API
    public Mono<SemanticRerankResult> semanticRerank(
       String rerankContext,
       List<String> documents,
       SemanticRerankRequestOptions options)
    public SemanticRerankResult semanticRerank(
        String rerankContext,
        List<String> documents,
        SemanticRerankRequestOptions options)

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions bot added the Cosmos label Mar 5, 2026
@mbhaskar mbhaskar marked this pull request as ready for review March 5, 2026 19:02
@mbhaskar mbhaskar requested review from a team and kirankumarkolli as code owners March 5, 2026 19:02
Copilot AI review requested due to automatic review settings March 5, 2026 19:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

aayush3011
aayush3011 previously approved these changes Mar 5, 2026
Copy link
Member

@aayush3011 aayush3011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good, left a few comments.


@JsonProperty("latency")
private SemanticRerankLatency latency;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this POJO SemanticRerankLatency we should have a Map here, because as discussed with the Inference team before, the result structure for latency can change in the future, and this would create a dependency on the SDK to make those changes. We followed the same approach in .Net SDK as well.

import com.azure.cosmos.models.SqlQuerySpec;
import com.azure.cosmos.models.ThroughputProperties;
import com.azure.cosmos.models.ThroughputResponse;
import com.azure.cosmos.models.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert this wild card import, we follow doing explicit import everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ quirk. Will revert

@aayush3011 aayush3011 dismissed their stale review March 5, 2026 20:07

Approved by mistake.

@@ -936,6 +951,13 @@ public <TContext> Iterable<CosmosBulkOperationResponse<TContext>> executeBulkOpe
return this.blockBulkResponse(asyncContainer.executeBulkOperations(Flux.fromIterable(operations), bulkOptions));
}

public SemanticRerankResult semanticRerank(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beta tag missing

@@ -149,6 +99,7 @@ public class CosmosAsyncContainer {
private final AtomicBoolean isInitialized;
private CosmosAsyncScripts scripts;
private IFaultInjectorProvider faultInjectorProvider;
private InferenceService infereceService;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: infereceService → inferenceService

return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
}

if (this.infereceService == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: infereceService → inferenceService

// Licensed under the MIT License.
package com.azure.cosmos.implementation.inference;

import com.azure.core.credential.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wildcard import

return URI.create(valueFromEnvVariable);
}

return URI.create(DEFAULT_THINCLIENT_ENDPOINT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong fallback endpoint

return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
}

if (this.infereceService == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy initialization of inferenceService is not thread-safe. Multiple threads calling semanticRerank() concurrently can race past the null check and each create a new InferenceService instance — each one allocating its own HttpClient with a connection pool and event loop threads. This can lead to resource waste and subtle bugs. Consider using synchronized, AtomicReference.compareAndSet, or initializing eagerly in the constructor.

public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;

private final URI inferenceEndpoint;
private final HttpClient httpClient;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InferenceService creates an HttpClient in its constructor but provides no close() or shutdown() method to release those resources. Combined with the thread-safety issue in CosmosAsyncContainer (which can create multiple instances), connection pools and threads can leak. Consider tying the lifecycle to the parent container or client so it gets cleaned up on client.close().

@@ -105,6 +105,7 @@ public final class CosmosAsyncClient implements Closeable {
private final ConsistencyLevel desiredConsistencyLevel;
private final ReadConsistencyStrategy readConsistencyStrategy;
private final AzureKeyCredential credential;
TokenCredential tokenCredential;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is package-private (no access modifier) while all other credential/config fields in this class (e.g., credential, clientTelemetryConfig, desiredConsistencyLevel) are declared private. Should be private to stay consistent with the encapsulation pattern of the rest of the class.

@@ -397,13 +393,21 @@ public class Configs {
private static final boolean DEFAULT_CLIENT_LEAK_DETECTION_ENABLED = false;
private static final String CLIENT_LEAK_DETECTION_ENABLED = "COSMOS.CLIENT_LEAK_DETECTION_ENABLED";

// Inference service related configs
private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of this is duplicated in 3 different places. Should we use a single source of truth and reference it.

INFERENCE_ENDPOINT_ENVIRONMENT_VARIABLE has the same value, and then the same variable is also being defined in the InferenceService.

/**
* Represents the result of a semantic rerank operation.
*/
@Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking · Correctness: Wrong Beta Version

V4_69_0 should be V4_78_0

This is a brand-new file introduced in this PR, but the @Beta annotation references V4_69_0:

@Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING)

The PR adds V4_78_0 to the Beta.SinceVersion enum specifically for this feature, and CosmosAsyncContainer.semanticRerank() correctly uses V4_78_0. This class (and SemanticRerankScore) should match — otherwise it falsely claims these models have existed since 4.69.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.


if (options != null) {
options.forEach((key, value) -> {
if (value instanceof Boolean) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking · Correctness: Options Silently Dropped

Only Boolean and Integer option values are forwarded — all other types are silently ignored

if (value instanceof Boolean) {
    payload.put(key, (Boolean) value);
} else if (value instanceof Integer) {
    payload.put(key, (Integer) value);
}

This means String values (like document_type, target_paths), Double values, and Long values are silently dropped from the request. Callers passing these options would get no error — the options simply don't reach the inference endpoint.

Cross-SDK comparison: Both .NET and Python correctly forward all option types:

  • .NET: payload.Add(option, options[option]) — adds any type
  • Python: body.update(semantic_reranking_options) — forwards entire dict

Python even documents specific string-typed options (document_type, target_paths) that would be silently dropped by this code.

Suggested fix: Use Jackson's ObjectMapper to serialize values generically, or handle all common types (String, Double, Long, Float, etc.) explicitly.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
}

if (this.infereceService == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Cross-SDK: Missing AAD Auth Enforcement

No validation that AAD auth is being used — will NPE if client uses key-based auth

if (this.infereceService == null) {
    this.infereceService = new InferenceService(this.database.getClient().tokenCredential());
}

If the CosmosClient was built with key-based auth (master key), tokenCredential() returns null. This passes null to InferenceService, which then throws a generic NullPointerException inside the checkNotNull — but with a misleading message ("Token credential is required for semantic reranking") that doesn't tell the user they need to switch to AAD auth.

Cross-SDK comparison:

  • .NET validates at construction: if (client.DocumentClient.cosmosAuthorization.GetType() != typeof(AuthorizationTokenProviderTokenCredential)) throw new InvalidOperationException("InferenceService only supports AAD authentication.")
  • Python checks: if inference_service is None: raise CosmosHttpResponseError(message="Semantic reranking requires AAD credentials")

Consider adding an explicit check before creating InferenceService:

TokenCredential cred = this.database.getClient().tokenCredential();
if (cred == null) {
    return Mono.error(new IllegalStateException(
        "Semantic reranking requires AAD authentication. Use CosmosClientBuilder.credential(TokenCredential) instead of key-based auth."));
}

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);
public static final Duration DEFAULT_IDLE_CONNECTION_TIMEOUT = Duration.ofSeconds(60);
public static final Duration DEFAULT_CONNECTION_ACQUIRE_TIMEOUT = Duration.ofSeconds(5);
public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Cross-SDK: Connection Pool Size

Connection pool hardcoded to 5 — .NET defaults to 50 (10x larger) and makes it configurable

public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;

The .NET SDK defaults to inferenceServiceDefaultMaxConnectionLimit = 50 and allows overriding via the AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_SERVICE_MAX_CONNECTION_LIMIT environment variable. Under concurrent semantic rerank calls, 5 connections would create contention and increased latency from connection acquire timeouts.

Consider matching .NET's default (50) and making this configurable via an environment variable or system property.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

logger.debug("Sending semantic rerank request to: {}", inferenceEndpoint);
}

return httpClient.send(httpRequest, DEFAULT_TIMEOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Cross-SDK: No Retry Logic

Zero retry logic — a single transient 429 or 500 fails permanently

return httpClient.send(httpRequest, DEFAULT_TIMEOUT)
    .flatMap(response -> parseResponse(response));

Both peer SDKs implement retry policies for the inference endpoint:

  • Python: Retries on 429/500 with backoff (TOTAL_RETRIES=3, RETRY_BACKOFF_FACTOR=0.8, RETRY_BACKOFF_MAX=120s)
  • .NET: Uses the standard HTTP handler retry pipeline

Inference endpoints are external services subject to transient failures and rate limiting. Without retries, a single 429 (throttle) or 503 (transient) response causes a permanent failure that the caller must handle. Consider adding at least basic retry logic for retryable status codes (429, 500, 502, 503).


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

@Beta(value = Beta.SinceVersion.V4_78_0)
public Mono<SemanticRerankResult> semanticRerank(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Cross-SDK: No Cancellation Support

No mechanism for callers to cancel or set custom timeouts

public Mono<SemanticRerankResult> semanticRerank(
    String rerankContext,
    List<String> documents,
    Map<String, Object> options)

The .NET API accepts CancellationToken cancellationToken = default, allowing callers to cancel in-flight requests. The Python SDK reads InferenceRequestTimeout from the connection policy, making timeouts configurable.

The Java async API returns Mono<> (callers can use .timeout() downstream), but the sync CosmosContainer.semanticRerank() blocks with a hardcoded 120-second timeout and no way to cancel. Consider either accepting a timeout/cancellation parameter or documenting the 120-second default prominently.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

@Beta(value = Beta.SinceVersion.V4_78_0)
public Mono<SemanticRerankResult> semanticRerank(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Cross-SDK: Undocumented Options

The options parameter has no documentation of valid keys

The Python SDK clearly documents all supported option keys in its docstring:

  • return_documents (bool): Whether to return document text in the response
  • top_k (int): Max documents to return
  • batch_size (int): Documents per batch
  • sort (bool): Whether to sort by relevance
  • document_type (str): Type of documents — "string" or "json"
  • target_paths (str): JSON paths for extraction when document_type is "json"

The Java API just says Map<String, Object> options with no indication of what keys are valid. Consider adding Javadoc listing the supported option keys, their types, and defaults.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

private static final Duration DEFAULT_TIMEOUT = Duration.ofSeconds(120);
private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT";
private static final ObjectMapper OBJECT_MAPPER = Utils.getSimpleObjectMapper();
public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation · Reliability: Timeout Configuration

Potential conflict between 5-second network timeout and 120-second operation timeout

public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);

The HttpClientConfig is configured with a 5-second network request timeout, but httpClient.send() is called with DEFAULT_TIMEOUT = Duration.ofSeconds(120). If the network request timeout governs the underlying connection/read timeout, inference calls taking longer than 5 seconds (which is expected for large document sets) would fail before the 120-second operation timeout applies. The Python SDK uses a configurable InferenceRequestTimeout (default 5s) for both connection_timeout and read_timeout — the intent is clearer there.

Worth verifying which timeout actually governs the behavior.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants