- Adding InferenceService for semantic reranking. by mbhaskar · Pull Request #48258 · Azure/azure-sdk-for-java

mbhaskar · 2026-03-05T01:30:10Z

Description

This PR adds semantic reranking functionality to the java SDK

new beta API

    public Mono<SemanticRerankResult> semanticRerank(
       String rerankContext,
       List<String> documents,
       SemanticRerankRequestOptions options)

    public SemanticRerankResult semanticRerank(
        String rerankContext,
        List<String> documents,
        SemanticRerankRequestOptions options)

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

aayush3011

Overall it looks good, left a few comments.

aayush3011 · 2026-03-05T19:52:22Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankResult.java

+
+    @JsonProperty("latency")
+    private SemanticRerankLatency latency;
+


Instead of having this POJO SemanticRerankLatency we should have a Map here, because as discussed with the Inference team before, the result structure for latency can change in the future, and this would create a dependency on the SDK to make those changes. We followed the same approach in .Net SDK as well.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankResult.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankRequestOptions.java

aayush3011 · 2026-03-05T19:57:47Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

-import com.azure.cosmos.models.SqlQuerySpec;
-import com.azure.cosmos.models.ThroughputProperties;
-import com.azure.cosmos.models.ThroughputResponse;
+import com.azure.cosmos.models.*;


Can we revert this wild card import, we follow doing explicit import everywhere.

IntelliJ quirk. Will revert

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

Approved by mistake.

aayush3011 · 2026-03-06T19:06:37Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosContainer.java

@@ -936,6 +951,13 @@ public <TContext> Iterable<CosmosBulkOperationResponse<TContext>> executeBulkOpe
        return this.blockBulkResponse(asyncContainer.executeBulkOperations(Flux.fromIterable(operations), bulkOptions));
    }

+    public SemanticRerankResult semanticRerank(


Beta tag missing

aayush3011 · 2026-03-06T19:08:11Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

@@ -149,6 +99,7 @@ public class CosmosAsyncContainer {
    private final AtomicBoolean isInitialized;
    private CosmosAsyncScripts scripts;
    private IFaultInjectorProvider faultInjectorProvider;
+    private InferenceService infereceService;


Typo: infereceService → inferenceService

aayush3011 · 2026-03-06T19:08:22Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

+            return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
+        }
+
+        if (this.infereceService == null) {


Typo: infereceService → inferenceService

aayush3011 · 2026-03-06T19:09:07Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+// Licensed under the MIT License.
+package com.azure.cosmos.implementation.inference;
+
+import com.azure.core.credential.*;


Nit: Wildcard import

aayush3011 · 2026-03-06T19:09:41Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

+            return URI.create(valueFromEnvVariable);
+        }
+
+        return URI.create(DEFAULT_THINCLIENT_ENDPOINT);


Wrong fallback endpoint

aayush3011 · 2026-03-06T19:11:26Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

+            return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
+        }
+
+        if (this.infereceService == null) {


The lazy initialization of inferenceService is not thread-safe. Multiple threads calling semanticRerank() concurrently can race past the null check and each create a new InferenceService instance — each one allocating its own HttpClient with a connection pool and event loop threads. This can lead to resource waste and subtle bugs. Consider using synchronized, AtomicReference.compareAndSet, or initializing eagerly in the constructor.

aayush3011 · 2026-03-06T19:13:12Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+    public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;
+
+    private final URI inferenceEndpoint;
+    private final HttpClient httpClient;


InferenceService creates an HttpClient in its constructor but provides no close() or shutdown() method to release those resources. Combined with the thread-safety issue in CosmosAsyncContainer (which can create multiple instances), connection pools and threads can leak. Consider tying the lifecycle to the parent container or client so it gets cleaned up on client.close().

aayush3011 · 2026-03-06T19:13:58Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncClient.java

@@ -105,6 +105,7 @@ public final class CosmosAsyncClient implements Closeable {
    private final ConsistencyLevel desiredConsistencyLevel;
    private final ReadConsistencyStrategy readConsistencyStrategy;
    private final AzureKeyCredential credential;
+    TokenCredential tokenCredential;


This field is package-private (no access modifier) while all other credential/config fields in this class (e.g., credential, clientTelemetryConfig, desiredConsistencyLevel) are declared private. Should be private to stay consistent with the encapsulation pattern of the rest of the class.

aayush3011 · 2026-03-06T19:16:48Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java

@@ -397,13 +393,21 @@ public class Configs {
    private static final boolean DEFAULT_CLIENT_LEAK_DETECTION_ENABLED = false;
    private static final String CLIENT_LEAK_DETECTION_ENABLED = "COSMOS.CLIENT_LEAK_DETECTION_ENABLED";

+    // Inference service related configs
+    private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT";


The value of this is duplicated in 3 different places. Should we use a single source of truth and reference it.

INFERENCE_ENDPOINT_ENVIRONMENT_VARIABLE has the same value, and then the same variable is also being defined in the InferenceService.

xinlian12 · 2026-03-08T06:39:28Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankResult.java

+/**
+ * Represents the result of a semantic rerank operation.
+ */
+@Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING)


🔴 Blocking · Correctness: Wrong Beta Version

V4_69_0 should be V4_78_0

This is a brand-new file introduced in this PR, but the @Beta annotation references V4_69_0:

@Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING)

The PR adds V4_78_0 to the Beta.SinceVersion enum specifically for this feature, and CosmosAsyncContainer.semanticRerank() correctly uses V4_78_0. This class (and SemanticRerankScore) should match — otherwise it falsely claims these models have existed since 4.69.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+
+                    if (options != null) {
+                        options.forEach((key, value) -> {
+                            if (value instanceof Boolean) {


🔴 Blocking · Correctness: Options Silently Dropped

Only Boolean and Integer option values are forwarded — all other types are silently ignored

if (value instanceof Boolean) { payload.put(key, (Boolean) value); } else if (value instanceof Integer) { payload.put(key, (Integer) value); }

This means String values (like document_type, target_paths), Double values, and Long values are silently dropped from the request. Callers passing these options would get no error — the options simply don't reach the inference endpoint.

Cross-SDK comparison: Both .NET and Python correctly forward all option types:

.NET: payload.Add(option, options[option]) — adds any type

Python: body.update(semantic_reranking_options) — forwards entire dict

Python even documents specific string-typed options (document_type, target_paths) that would be silently dropped by this code.

Suggested fix: Use Jackson's ObjectMapper to serialize values generically, or handle all common types (String, Double, Long, Float, etc.) explicitly.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

+            return Mono.error(new IllegalArgumentException("Documents list cannot be empty"));
+        }
+
+        if (this.infereceService == null) {


🟡 Recommendation · Cross-SDK: Missing AAD Auth Enforcement

No validation that AAD auth is being used — will NPE if client uses key-based auth

if (this.infereceService == null) { this.infereceService = new InferenceService(this.database.getClient().tokenCredential()); }

If the CosmosClient was built with key-based auth (master key), tokenCredential() returns null. This passes null to InferenceService, which then throws a generic NullPointerException inside the checkNotNull — but with a misleading message ("Token credential is required for semantic reranking") that doesn't tell the user they need to switch to AAD auth.

Cross-SDK comparison:

.NET validates at construction: if (client.DocumentClient.cosmosAuthorization.GetType() != typeof(AuthorizationTokenProviderTokenCredential)) throw new InvalidOperationException("InferenceService only supports AAD authentication.")

Python checks: if inference_service is None: raise CosmosHttpResponseError(message="Semantic reranking requires AAD credentials")

Consider adding an explicit check before creating InferenceService:

TokenCredential cred = this.database.getClient().tokenCredential(); if (cred == null) { return Mono.error(new IllegalStateException( "Semantic reranking requires AAD authentication. Use CosmosClientBuilder.credential(TokenCredential) instead of key-based auth.")); }

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+    public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);
+    public static final Duration DEFAULT_IDLE_CONNECTION_TIMEOUT = Duration.ofSeconds(60);
+    public static final Duration DEFAULT_CONNECTION_ACQUIRE_TIMEOUT = Duration.ofSeconds(5);
+    public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;


🟡 Recommendation · Cross-SDK: Connection Pool Size

Connection pool hardcoded to 5 — .NET defaults to 50 (10x larger) and makes it configurable

public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;

The .NET SDK defaults to inferenceServiceDefaultMaxConnectionLimit = 50 and allows overriding via the AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_SERVICE_MAX_CONNECTION_LIMIT environment variable. Under concurrent semantic rerank calls, 5 connections would create contention and increased latency from connection acquire timeouts.

Consider matching .NET's default (50) and making this configurable via an environment variable or system property.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+                        logger.debug("Sending semantic rerank request to: {}", inferenceEndpoint);
+                    }
+
+                    return httpClient.send(httpRequest, DEFAULT_TIMEOUT)


🟡 Recommendation · Cross-SDK: No Retry Logic

Zero retry logic — a single transient 429 or 500 fails permanently

return httpClient.send(httpRequest, DEFAULT_TIMEOUT) .flatMap(response -> parseResponse(response));

Both peer SDKs implement retry policies for the inference endpoint:

Python: Retries on 429/500 with backoff (TOTAL_RETRIES=3, RETRY_BACKOFF_FACTOR=0.8, RETRY_BACKOFF_MAX=120s)

.NET: Uses the standard HTTP handler retry pipeline

Inference endpoints are external services subject to transient failures and rate limiting. Without retries, a single 429 (throttle) or 503 (transient) response causes a permanent failure that the caller must handle. Consider adding at least basic retry logic for retryable status codes (429, 500, 502, 503).

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

    }

+    @Beta(value = Beta.SinceVersion.V4_78_0)
+    public Mono<SemanticRerankResult> semanticRerank(


🟡 Recommendation · Cross-SDK: No Cancellation Support

No mechanism for callers to cancel or set custom timeouts

public Mono<SemanticRerankResult> semanticRerank( String rerankContext, List<String> documents, Map<String, Object> options)

The .NET API accepts CancellationToken cancellationToken = default, allowing callers to cancel in-flight requests. The Python SDK reads InferenceRequestTimeout from the connection policy, making timeouts configurable.

The Java async API returns Mono<> (callers can use .timeout() downstream), but the sync CosmosContainer.semanticRerank() blocks with a hardcoded 120-second timeout and no way to cancel. Consider either accepting a timeout/cancellation parameter or documenting the 120-second default prominently.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java

    }

+    @Beta(value = Beta.SinceVersion.V4_78_0)
+    public Mono<SemanticRerankResult> semanticRerank(


🟢 Suggestion · Cross-SDK: Undocumented Options

The options parameter has no documentation of valid keys

The Python SDK clearly documents all supported option keys in its docstring:

return_documents (bool): Whether to return document text in the response

top_k (int): Max documents to return

batch_size (int): Documents per batch

sort (bool): Whether to sort by relevance

document_type (str): Type of documents — "string" or "json"

target_paths (str): JSON paths for extraction when document_type is "json"

The Java API just says Map<String, Object> options with no indication of what keys are valid. Consider adding Javadoc listing the supported option keys, their types, and defaults.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

xinlian12 · 2026-03-08T06:39:28Z

...s/azure-cosmos/src/main/java/com/azure/cosmos/implementation/inference/InferenceService.java

+    private static final Duration DEFAULT_TIMEOUT = Duration.ofSeconds(120);
+    private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT";
+    private static final ObjectMapper OBJECT_MAPPER = Utils.getSimpleObjectMapper();
+    public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);


💬 Observation · Reliability: Timeout Configuration

Potential conflict between 5-second network timeout and 120-second operation timeout

public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);

The HttpClientConfig is configured with a 5-second network request timeout, but httpClient.send() is called with DEFAULT_TIMEOUT = Duration.ofSeconds(120). If the network request timeout governs the underlying connection/read timeout, inference calls taking longer than 5 seconds (which is expected for large document sets) would fail before the 120-second operation timeout applies. The Python SDK uses a configurable InferenceRequestTimeout (default 5s) for both connection_timeout and read_timeout — the intent is clearer there.

Worth verifying which timeout actually governs the behavior.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

- Adding InferenceService for semantic reranking.

5bf8a96

github-actions bot added the Cosmos label Mar 5, 2026

mbhaskar marked this pull request as ready for review March 5, 2026 19:02

mbhaskar requested review from a team and kirankumarkolli as code owners March 5, 2026 19:02

Copilot AI review requested due to automatic review settings March 5, 2026 19:02

Copilot AI reviewed Mar 5, 2026

View reviewed changes

aayush3011 previously approved these changes Mar 5, 2026

View reviewed changes

aayush3011 reviewed Mar 6, 2026

View reviewed changes

aayush3011 requested changes Mar 6, 2026

View reviewed changes

Removing first class objects for options

e536610

xinlian12 reviewed Mar 8, 2026

View reviewed changes


		@JsonProperty("latency")
		private SemanticRerankLatency latency;

Conversation

mbhaskar commented Mar 5, 2026

Description

All SDK Contribution checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

aayush3011 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants