- Adding InferenceService for semantic reranking.#48258
- Adding InferenceService for semantic reranking.#48258mbhaskar wants to merge 2 commits intoAzure:mainfrom
Conversation
aayush3011
left a comment
There was a problem hiding this comment.
Overall it looks good, left a few comments.
|
|
||
| @JsonProperty("latency") | ||
| private SemanticRerankLatency latency; | ||
|
|
There was a problem hiding this comment.
Instead of having this POJO SemanticRerankLatency we should have a Map here, because as discussed with the Inference team before, the result structure for latency can change in the future, and this would create a dependency on the SDK to make those changes. We followed the same approach in .Net SDK as well.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankResult.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/SemanticRerankRequestOptions.java
Outdated
Show resolved
Hide resolved
| import com.azure.cosmos.models.SqlQuerySpec; | ||
| import com.azure.cosmos.models.ThroughputProperties; | ||
| import com.azure.cosmos.models.ThroughputResponse; | ||
| import com.azure.cosmos.models.*; |
There was a problem hiding this comment.
Can we revert this wild card import, we follow doing explicit import everywhere.
There was a problem hiding this comment.
IntelliJ quirk. Will revert
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/CosmosAsyncContainer.java
Outdated
Show resolved
Hide resolved
| @@ -936,6 +951,13 @@ public <TContext> Iterable<CosmosBulkOperationResponse<TContext>> executeBulkOpe | |||
| return this.blockBulkResponse(asyncContainer.executeBulkOperations(Flux.fromIterable(operations), bulkOptions)); | |||
| } | |||
|
|
|||
| public SemanticRerankResult semanticRerank( | |||
| @@ -149,6 +99,7 @@ public class CosmosAsyncContainer { | |||
| private final AtomicBoolean isInitialized; | |||
| private CosmosAsyncScripts scripts; | |||
| private IFaultInjectorProvider faultInjectorProvider; | |||
| private InferenceService infereceService; | |||
There was a problem hiding this comment.
Typo: infereceService → inferenceService
| return Mono.error(new IllegalArgumentException("Documents list cannot be empty")); | ||
| } | ||
|
|
||
| if (this.infereceService == null) { |
There was a problem hiding this comment.
Typo: infereceService → inferenceService
| // Licensed under the MIT License. | ||
| package com.azure.cosmos.implementation.inference; | ||
|
|
||
| import com.azure.core.credential.*; |
| return URI.create(valueFromEnvVariable); | ||
| } | ||
|
|
||
| return URI.create(DEFAULT_THINCLIENT_ENDPOINT); |
| return Mono.error(new IllegalArgumentException("Documents list cannot be empty")); | ||
| } | ||
|
|
||
| if (this.infereceService == null) { |
There was a problem hiding this comment.
The lazy initialization of inferenceService is not thread-safe. Multiple threads calling semanticRerank() concurrently can race past the null check and each create a new InferenceService instance — each one allocating its own HttpClient with a connection pool and event loop threads. This can lead to resource waste and subtle bugs. Consider using synchronized, AtomicReference.compareAndSet, or initializing eagerly in the constructor.
| public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5; | ||
|
|
||
| private final URI inferenceEndpoint; | ||
| private final HttpClient httpClient; |
There was a problem hiding this comment.
InferenceService creates an HttpClient in its constructor but provides no close() or shutdown() method to release those resources. Combined with the thread-safety issue in CosmosAsyncContainer (which can create multiple instances), connection pools and threads can leak. Consider tying the lifecycle to the parent container or client so it gets cleaned up on client.close().
| @@ -105,6 +105,7 @@ public final class CosmosAsyncClient implements Closeable { | |||
| private final ConsistencyLevel desiredConsistencyLevel; | |||
| private final ReadConsistencyStrategy readConsistencyStrategy; | |||
| private final AzureKeyCredential credential; | |||
| TokenCredential tokenCredential; | |||
There was a problem hiding this comment.
This field is package-private (no access modifier) while all other credential/config fields in this class (e.g., credential, clientTelemetryConfig, desiredConsistencyLevel) are declared private. Should be private to stay consistent with the encapsulation pattern of the rest of the class.
| @@ -397,13 +393,21 @@ public class Configs { | |||
| private static final boolean DEFAULT_CLIENT_LEAK_DETECTION_ENABLED = false; | |||
| private static final String CLIENT_LEAK_DETECTION_ENABLED = "COSMOS.CLIENT_LEAK_DETECTION_ENABLED"; | |||
|
|
|||
| // Inference service related configs | |||
| private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT"; | |||
There was a problem hiding this comment.
The value of this is duplicated in 3 different places. Should we use a single source of truth and reference it.
INFERENCE_ENDPOINT_ENVIRONMENT_VARIABLE has the same value, and then the same variable is also being defined in the InferenceService.
| /** | ||
| * Represents the result of a semantic rerank operation. | ||
| */ | ||
| @Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING) |
There was a problem hiding this comment.
🔴 Blocking · Correctness: Wrong Beta Version
V4_69_0 should be V4_78_0
This is a brand-new file introduced in this PR, but the @Beta annotation references V4_69_0:
@Beta(value = Beta.SinceVersion.V4_69_0, warningText = Beta.PREVIEW_SUBJECT_TO_CHANGE_WARNING)The PR adds V4_78_0 to the Beta.SinceVersion enum specifically for this feature, and CosmosAsyncContainer.semanticRerank() correctly uses V4_78_0. This class (and SemanticRerankScore) should match — otherwise it falsely claims these models have existed since 4.69.
|
|
||
| if (options != null) { | ||
| options.forEach((key, value) -> { | ||
| if (value instanceof Boolean) { |
There was a problem hiding this comment.
🔴 Blocking · Correctness: Options Silently Dropped
Only Boolean and Integer option values are forwarded — all other types are silently ignored
if (value instanceof Boolean) {
payload.put(key, (Boolean) value);
} else if (value instanceof Integer) {
payload.put(key, (Integer) value);
}This means String values (like document_type, target_paths), Double values, and Long values are silently dropped from the request. Callers passing these options would get no error — the options simply don't reach the inference endpoint.
Cross-SDK comparison: Both .NET and Python correctly forward all option types:
- .NET:
payload.Add(option, options[option])— adds any type - Python:
body.update(semantic_reranking_options)— forwards entire dict
Python even documents specific string-typed options (document_type, target_paths) that would be silently dropped by this code.
Suggested fix: Use Jackson's ObjectMapper to serialize values generically, or handle all common types (String, Double, Long, Float, etc.) explicitly.
| return Mono.error(new IllegalArgumentException("Documents list cannot be empty")); | ||
| } | ||
|
|
||
| if (this.infereceService == null) { |
There was a problem hiding this comment.
🟡 Recommendation · Cross-SDK: Missing AAD Auth Enforcement
No validation that AAD auth is being used — will NPE if client uses key-based auth
if (this.infereceService == null) {
this.infereceService = new InferenceService(this.database.getClient().tokenCredential());
}If the CosmosClient was built with key-based auth (master key), tokenCredential() returns null. This passes null to InferenceService, which then throws a generic NullPointerException inside the checkNotNull — but with a misleading message ("Token credential is required for semantic reranking") that doesn't tell the user they need to switch to AAD auth.
Cross-SDK comparison:
- .NET validates at construction:
if (client.DocumentClient.cosmosAuthorization.GetType() != typeof(AuthorizationTokenProviderTokenCredential)) throw new InvalidOperationException("InferenceService only supports AAD authentication.") - Python checks:
if inference_service is None: raise CosmosHttpResponseError(message="Semantic reranking requires AAD credentials")
Consider adding an explicit check before creating InferenceService:
TokenCredential cred = this.database.getClient().tokenCredential();
if (cred == null) {
return Mono.error(new IllegalStateException(
"Semantic reranking requires AAD authentication. Use CosmosClientBuilder.credential(TokenCredential) instead of key-based auth."));
}| public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5); | ||
| public static final Duration DEFAULT_IDLE_CONNECTION_TIMEOUT = Duration.ofSeconds(60); | ||
| public static final Duration DEFAULT_CONNECTION_ACQUIRE_TIMEOUT = Duration.ofSeconds(5); | ||
| public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5; |
There was a problem hiding this comment.
🟡 Recommendation · Cross-SDK: Connection Pool Size
Connection pool hardcoded to 5 — .NET defaults to 50 (10x larger) and makes it configurable
public static final int DEFAULT_MAX_CONNECTION_POOL_SIZE = 5;The .NET SDK defaults to inferenceServiceDefaultMaxConnectionLimit = 50 and allows overriding via the AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_SERVICE_MAX_CONNECTION_LIMIT environment variable. Under concurrent semantic rerank calls, 5 connections would create contention and increased latency from connection acquire timeouts.
Consider matching .NET's default (50) and making this configurable via an environment variable or system property.
| logger.debug("Sending semantic rerank request to: {}", inferenceEndpoint); | ||
| } | ||
|
|
||
| return httpClient.send(httpRequest, DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
🟡 Recommendation · Cross-SDK: No Retry Logic
Zero retry logic — a single transient 429 or 500 fails permanently
return httpClient.send(httpRequest, DEFAULT_TIMEOUT)
.flatMap(response -> parseResponse(response));Both peer SDKs implement retry policies for the inference endpoint:
- Python: Retries on 429/500 with backoff (
TOTAL_RETRIES=3,RETRY_BACKOFF_FACTOR=0.8,RETRY_BACKOFF_MAX=120s) - .NET: Uses the standard HTTP handler retry pipeline
Inference endpoints are external services subject to transient failures and rate limiting. Without retries, a single 429 (throttle) or 503 (transient) response causes a permanent failure that the caller must handle. Consider adding at least basic retry logic for retryable status codes (429, 500, 502, 503).
| } | ||
|
|
||
| @Beta(value = Beta.SinceVersion.V4_78_0) | ||
| public Mono<SemanticRerankResult> semanticRerank( |
There was a problem hiding this comment.
🟡 Recommendation · Cross-SDK: No Cancellation Support
No mechanism for callers to cancel or set custom timeouts
public Mono<SemanticRerankResult> semanticRerank(
String rerankContext,
List<String> documents,
Map<String, Object> options)The .NET API accepts CancellationToken cancellationToken = default, allowing callers to cancel in-flight requests. The Python SDK reads InferenceRequestTimeout from the connection policy, making timeouts configurable.
The Java async API returns Mono<> (callers can use .timeout() downstream), but the sync CosmosContainer.semanticRerank() blocks with a hardcoded 120-second timeout and no way to cancel. Consider either accepting a timeout/cancellation parameter or documenting the 120-second default prominently.
| } | ||
|
|
||
| @Beta(value = Beta.SinceVersion.V4_78_0) | ||
| public Mono<SemanticRerankResult> semanticRerank( |
There was a problem hiding this comment.
🟢 Suggestion · Cross-SDK: Undocumented Options
The options parameter has no documentation of valid keys
The Python SDK clearly documents all supported option keys in its docstring:
return_documents(bool): Whether to return document text in the responsetop_k(int): Max documents to returnbatch_size(int): Documents per batchsort(bool): Whether to sort by relevancedocument_type(str): Type of documents — "string" or "json"target_paths(str): JSON paths for extraction when document_type is "json"
The Java API just says Map<String, Object> options with no indication of what keys are valid. Consider adding Javadoc listing the supported option keys, their types, and defaults.
| private static final Duration DEFAULT_TIMEOUT = Duration.ofSeconds(120); | ||
| private static final String INFERENCE_ENDPOINT_PROPERTY = "AZURE_COSMOS_SEMANTIC_RERANKER_INFERENCE_ENDPOINT"; | ||
| private static final ObjectMapper OBJECT_MAPPER = Utils.getSimpleObjectMapper(); | ||
| public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5); |
There was a problem hiding this comment.
💬 Observation · Reliability: Timeout Configuration
Potential conflict between 5-second network timeout and 120-second operation timeout
public static final Duration DEFAULT_NETWORK_REQUEST_TIMEOUT = Duration.ofSeconds(5);The HttpClientConfig is configured with a 5-second network request timeout, but httpClient.send() is called with DEFAULT_TIMEOUT = Duration.ofSeconds(120). If the network request timeout governs the underlying connection/read timeout, inference calls taking longer than 5 seconds (which is expected for large document sets) would fail before the 120-second operation timeout applies. The Python SDK uses a configurable InferenceRequestTimeout (default 5s) for both connection_timeout and read_timeout — the intent is clearer there.
Worth verifying which timeout actually governs the behavior.
Description
This PR adds semantic reranking functionality to the java SDK
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines