Skip to content

Enhance deferred fields for production ETL reliability#168

Open
bosd wants to merge 110 commits intomasterfrom
feature/production-ready-etl
Open

Enhance deferred fields for production ETL reliability#168
bosd wants to merge 110 commits intomasterfrom
feature/production-ready-etl

Conversation

@bosd
Copy link
Member

@bosd bosd commented Dec 21, 2025

Summary

This PR addresses critical issues with the deferred-fields feature to make odoo-data-flow production-ready for ETL operations.

Key Fixes

  • Fix deferred-fields matching - Handle both field and field/id formats correctly

    • Normalize field names in Pass 1 ignore filtering
    • Normalize field names in Pass 2 data preparation
  • Add XML-ID resolution for non-self-referencing fields - Support fields like responsible_id that reference other models (e.g., res.users)

    • Added _resolve_external_id_for_pass2() helper function
    • Tries multiple XML-ID variations (module.name, export.prefix, etc.)
  • Fix batch rejection error handling - Records no longer inherit the same error message

    • Added _extract_per_row_errors() to parse per-row errors from Odoo's response
    • Falls back to individual processing when batch has multiple failures
    • First failed record gets batch error, subsequent records get reference
  • Add binary field deferral support - Allow deferring image fields like image_1920

    • Non-relational fields are written directly in Pass 2 (base64 data)
  • Add --company-id CLI parameter - Simplify multicompany imports

    • Sets allowed_company_ids and force_company in context
  • Fix CLI deferred-fields parsing - Convert comma-separated string to list

Tested With

  • Local Odoo 18 instance
  • Verified Pass 1 correctly excludes deferred fields
  • Verified Pass 2 resolves XML-IDs and updates records
  • All 382 unit tests pass
  • Pre-commit, mypy, typeguard all pass

Test plan

  • Test with existing ETL scripts using --deferred-fields
  • Test fail mode with deferred fields
  • Test multicompany imports with --company-id
  • Test image deferral with --deferred-fields image_1920

bosd and others added 2 commits December 21, 2025 21:20
- Fix deferred-fields matching to handle both 'field' and 'field/id' formats
- Add XML-ID resolution for non-self-referencing deferred fields (e.g., responsible_id)
- Support binary field deferral for image imports (e.g., image_1920)
- Fix batch rejection to not inherit same error message for all records
  - Extract per-row errors from Odoo's response when available
  - Fall back to individual processing when batch has multiple failures
- Add --company-id CLI parameter for multicompany imports
  - Sets allowed_company_ids and force_company in context
- Add _extract_per_row_errors helper for parsing Odoo's error messages
- Add _resolve_external_id_for_pass2 helper for XML-ID resolution

These changes address critical issues with:
- Deferred fields not working in fail mode
- All batch records inheriting the same failure reason
- Cross-company field references causing import failures
- Large image imports overwhelming the server

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert deferred_fields CLI parameter from comma-separated string to list
- Fix ignore_list filtering in Pass 1 to handle both 'field' and 'field/id' formats
  - Normalize ignore_set to strip '/id' suffix before matching column names
- Verified working with local Odoo 18 instance:
  - Pass 1 correctly excludes deferred fields from initial import
  - Pass 2 successfully resolves XML-IDs and updates records

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @bosd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the 'deferred-fields' feature within the Odoo data flow tool, making it more robust and reliable for production ETL operations. The changes focus on improving data integrity, error handling, and usability, particularly for complex scenarios involving cross-model references, binary data, and multicompany setups.

Highlights

  • Deferred Field Matching: Improved handling for deferred fields, correctly parsing both 'field' and 'field/id' formats during data preparation and ignore filtering.
  • XML-ID Resolution: Introduced a mechanism to resolve XML-IDs for non-self-referencing fields (e.g., 'responsible_id' referencing 'res.users'), ensuring proper linking across different models.
  • Enhanced Batch Error Handling: Implemented granular error reporting for batch rejections, allowing individual records to receive specific error messages rather than inheriting a generic batch error, and falling back to individual processing when necessary.
  • Binary Field Deferral: Added support for deferring non-relational fields like binary data (e.g., 'image_1920'), which are now written directly in the second pass.
  • Multicompany CLI Support: Introduced a new '--company-id' CLI parameter to simplify imports in multicompany Odoo environments by setting appropriate context variables.
  • CLI Deferred Fields Parsing: Fixed the command-line interface to correctly parse comma-separated 'deferred-fields' strings into a list.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the "deferred-fields" feature, improving its reliability for production ETL workloads through better field matching, external XML-ID resolution, granular error handling, and binary field support. However, a medium-severity vulnerability was identified in the error handling of the new XML-ID resolution logic, which could lead to incorrect data relationships and potential access control risks under specific failure conditions. Additionally, feedback includes suggestions for moving an import to the top level for style consistency, simplifying a complex connection probing block, and refining exception handling for better specificity and robustness.

Comment on lines +403 to +404
except Exception: # noqa: S112
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The broad except Exception: statement within the ID resolution loop is a medium-severity vulnerability. It can lead to incorrect record linking and potential access control issues by silently catching transient errors (e.g., network issues, DB locks) and proceeding with less specific variations. This could result in an external ID resolving to the wrong database record, linking data to an incorrect owner or parent. Beyond the security implications, broad exception handling also hides unexpected bugs and makes debugging difficult. It is recommended to catch more specific exceptions, such as OdooError, KeyError, ValueError, or IndexError, and only catch 'not found' errors, allowing other exceptions to fail loudly to prevent incorrect data mapping and improve debuggability.

Returns:
A dictionary mapping row indices (0-based) to error messages.
"""
import re

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code style and consistency, it's recommended to place all imports at the top of the file, as per PEP 8 guidelines. Moving import re to the top of import_threaded.py will improve readability and avoid re-importing the module on each function call.

Comment on lines +282 to +304
conn = None
for attr in ["connection", "client", "_connection", "_client"]:
try:
val = getattr(model_obj, attr, None)
if val and not callable(val):
conn = val
break
elif val and callable(val) and hasattr(val, "get_model"):
conn = val
break
except Exception: # noqa: S112
continue

if conn:
for method_name in ["model", "get_model"]:
if hasattr(conn, method_name):
try:
method = getattr(conn, method_name)
ir_model_data_proxy = method("ir.model.data")
if ir_model_data_proxy:
break
except Exception: # noqa: S112
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for retrieving the ir.model.data proxy is overly complex and relies on probing several private attributes, which is fragile and can break with library updates. Since odoolib model objects typically store a reference to their connection, you can simplify this logic significantly.

A more direct approach is to access the connection object and call get_model on it. This is more readable, maintainable, and robust.

            conn = getattr(model_obj, "_connection", getattr(model_obj, "connection", None))
            if conn and hasattr(conn, "get_model"):
                try:
                    ir_model_data_proxy = conn.get_model("ir.model.data")
                except Exception:  # noqa: S112
                    pass

bosd and others added 26 commits December 21, 2025 23:15
Adds --auto-defer CLI flag that automatically defers all non-required
many2one fields to Pass 2. This enables progressive import where
records are created first and relational fields are populated
afterwards. Required many2one fields are NOT deferred as they must
succeed in Pass 1.

Usage: odoo-data-flow import --auto-defer --file data.csv --model res.partner
When records are created using the create() method (in fail mode or
when load() falls back to create()), XML IDs were not being persisted
to ir.model.data. This caused XML IDs to be missing after import.

Added _create_xmlid_entry() helper function that:
- Parses module and name from XML ID (uses __import__ for IDs without prefix)
- Creates or updates ir.model.data entry for each created record
- Handles edge cases like existing entries with different res_id

This ensures XML IDs are properly persisted regardless of whether
records are created via load() or create().
…acks

Added new CLI options for better control over import behavior:

--on-missing-ref: Handle missing references per field
  - create: auto-create via name_create
  - skip: skip row (default)
  - empty: set field to False

--auto-create-refs: Auto-create all missing m2o references

--set-empty-on-missing: Set fields to empty on missing refs

--fallback-values: Default values for invalid selection/boolean fields

--tracking-disable/--tracking-enable: Control mail tracking (default: disabled)

--defer-parent-store: Defer parent store computation for hierarchies

These options map to Odoo's native import context parameters:
- name_create_enabled_fields
- import_set_empty_fields
- fallback_values
- defer_parent_store_computation
Performance optimizations:
- Remove hard-coded 4-thread connection cap in RpcThread
  Users can now specify higher --worker values based on server capacity
- Add LRU cache (100k entries) to to_xmlid() function
  Significantly speeds up repeated XML ID sanitizations
- Pre-calculate column filter indices before batch loop
  Ignore set and indices now computed once per batch, not per chunk

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add protocol selection to import and export commands:
- --protocol option: xmlrpc, xmlrpcs, jsonrpc, jsonrpcs, json2, json2s
- Can also set protocol in connection config file
- JSON-RPC recommended for Odoo 10-18 (~30% faster than XML-RPC)
- JSON-2 supported for Odoo 19+ (requires API key)

Protocol is passed through odoolib which handles the actual connection.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --ignore CLI option was not being converted from a comma-separated
string to a list before being passed to run_import(), causing a
TypeError when concatenating with deferred_fields list.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Configuration guide:
- Document all protocol options (xmlrpc, jsonrpc, json2)
- Add JSON-RPC performance recommendation for Odoo 10+
- Document JSON-2 API for Odoo 19+ with API key requirements
- Add CLI --protocol override example

Performance tuning guide:
- Add new "Choosing the Right Protocol" section
- Add protocol comparison table
- Add worker tuning section with db_maxconn formula
- Add warnings about connection pool exhaustion

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Verify that import correctly preserves:
- Unicode characters (Japanese, Chinese, Korean, emojis)
- Multiline values in text fields
- Tab characters
- Quoted strings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add batch_delay parameter to control the pause between batch submissions
during imports. This helps prevent server overload and 503 errors when
importing large datasets.

- Add --delay CLI option (default: 0, recommended: 0.5-2.0 for busy servers)
- Propagate batch_delay through import_data and _orchestrate_pass_1
- Add delay between batch submissions in _run_threaded_pass
- Fix Python 3.14 compatibility for ValueError message format in test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the server returns 502/503 errors indicating overload, the importer
now automatically:
- Detects server overload conditions (502, 503, service unavailable)
- Adds increasing delays (up to 10 seconds) between batch submissions
- Gradually reduces the delay after successful batches
- Combines with user-specified --delay for total throttling

This helps prevent overwhelming busy servers and allows imports to
complete even under high load conditions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The progress bar was shifting because the RichHandler and Progress bar
use separate Console instances that compete for stdout. Added a context
manager `suppress_console_handler()` that temporarily disables the
RichHandler while a Progress bar is active.

Applied to all Progress bars in:
- import_threaded.py
- export_threaded.py
- write_threaded.py
- importer.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Exclude mapper.py (callable objects break introspection)
- Add write_threaded.py and tools.py to compilation
- Add usage documentation to setup.py docstring
- Add *.so to .gitignore

To build with mypyc:
  ODF_COMPILE_MYPYC=1 python setup.py build_ext --inplace

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive tests for _extract_per_row_errors function
- Add tests for _filter_ignored_columns edge cases
- Add tests for _execute_write_batch success and failure paths
- Add tests for _execute_load_batch force_create, timeout, and pool errors
- Add tests for _format_odoo_error dict extraction
- Add tests for _create_batch_individually error handling
- Add tests for import_data with dict config
- Add tests for relational_import derivation and query functions
- Add tests for O2M tuple import edge cases
- Add tests for write tuple import edge cases

Coverage improved from 80.65% to 85.28%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements streaming CSV processing that reads and processes data in
batches without loading the entire file into memory:

- Add _stream_csv_batches() generator that yields batches directly from file
- Add _count_csv_rows() for progress bar initialization
- Add _orchestrate_streaming_pass_1() for streaming import orchestration
- Add --stream CLI flag for enabling streaming mode
- Automatic fallback to standard mode when incompatible options are used
  (o2m, groupby, deferred_fields, force_create)

Streaming mode is ideal for very large CSV files where memory is a concern.
When enabled, the importer processes batches as they are read from disk,
significantly reducing peak memory usage.

Usage:
  odoo-data-flow import conn.conf data.csv --model res.partner --stream

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Checkpoint/Resume Support:
- Add checkpoint module for saving/restoring import progress
- Save checkpoint after Pass 1 completes with id_map
- Resume from checkpoint if Pass 1 was already completed
- Delete checkpoint on successful completion
- File hash check prevents resuming if data file changed
- CLI options: --resume/--no-resume, --no-checkpoint

Multi-Company Support:
- Add --all-companies flag to auto-set allowed_company_ids
- Fetches user's company_ids and sets context automatically
- Mimics Odoo web UI behavior for cross-company imports

Bug Fixes:
- Fix Pass 2 failures not being written to fail file
- Use sanitized IDs in source_data_map to match id_map keys

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --dry-run option to validate CSV data before importing:
- Checks required fields are populated
- Validates selection field values against allowed values
- Verifies relational references exist in Odoo
- Displays formatted validation results with error summary

New validation module:
- ValidationError and ValidationResult dataclasses
- Reference checking for both external IDs and database IDs
- Caching of reference lookups for performance
- Formatted output with rich panels

Usage: odoo-data-flow import --dry-run --file data.csv --model res.partner

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --check-refs option to verify relational references before import:
- Scans CSV for all many2one/many2many references
- Batch-checks external IDs and database IDs against Odoo
- Reports missing references with examples

Options:
- --check-refs=fail: Abort import if references missing (strict mode)
- --check-refs=warn: Show warning but continue (default)
- --check-refs=skip: Skip the reference check entirely

This helps catch missing reference data early, avoiding partial
imports that fail mid-way through processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add intelligent error categorization and retry strategies:

Error Categories:
- Transient: Timeouts, 502/503, deadlocks, connection pool - will retry
- Permanent: Constraint violations, access denied - fail immediately
- Recoverable: Missing references, company issues - suggest alternatives

Features:
- Exponential backoff with configurable base delay and max delay
- Jitter to prevent thundering herd effect
- Retry statistics tracking
- Helper functions for retry decisions
- Recommendations for error handling

Usage:
- categorize_error(error) -> (ErrorCategory, pattern)
- retry_with_backoff(func, config, stats) -> (result, error)
- get_retry_recommendation(error) -> dict with action/message

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add functionality for skip-unchanged record detection:

Features:
- Normalize values for comparison (handles False, empty strings, m2o tuples)
- Compare source values with existing Odoo records
- Filter out unchanged rows before import
- Track statistics (new, changed, unchanged, skip rate)

Key functions:
- get_existing_records(): Fetch records from Odoo by external ID
- find_unchanged_records(): Identify unchanged records from dict data
- filter_unchanged_rows(): Filter unchanged rows from list data
- display_idempotent_stats(): Show import statistics

This module enables imports to be run multiple times safely, only
importing records that have actually changed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add adaptive throttling based on server response times:

Server Health Levels:
- HEALTHY: Normal operation, no throttling
- DEGRADED: Slight slowdown, add small delays
- STRESSED: Significant load, reduce batch sizes
- OVERLOADED: Critical, aggressive throttling

Features:
- Rolling average response time monitoring
- Automatic delay adjustment between requests
- Dynamic batch size scaling based on health
- Hysteresis for health recovery (prevents flapping)
- Error recording for server errors (5xx)
- Comprehensive statistics tracking

Configuration:
- Customizable thresholds for each health level
- Configurable delays and batch multipliers
- Aggressive mode for sensitive servers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete integration of the remaining 3 stability features:

1. **Smarter Retry Logic** - Integrated into error handling:
   - Uses ErrorCategory enum to classify errors as transient/permanent
   - Exponential backoff with jitter for server overload (502/503)
   - Database serialization conflict handling with backoff

2. **Idempotent Import Mode** (`--skip-unchanged`):
   - Fetches existing records from Odoo before import
   - Compares field values to detect unchanged records
   - Skips records that haven't changed, making imports idempotent
   - Reports skip statistics in final output

3. **Health-Aware Throttling** (`--adaptive-throttle`):
   - ThrottleController monitors server response times
   - Automatically adjusts delays based on server health
   - Records timing after each batch load operation
   - Reports throttle statistics at end of import

All 597 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds a comprehensive workflow for managing VAT validation during
contact imports, addressing VIES API timeouts in large imports.

Features:
- Local VAT format validation with regex patterns for all EU countries
- Checksum validation for BE, DE, NL
- Support for custom validators (e.g., Rust-based via PyO3)
- Save/restore VAT validation settings across companies
- Disable both VIES (online) and stdnum (local) validation
- Batch VIES validation with user notifications

CLI commands:
- vat get-settings: Display current VAT validation settings
- vat disable: Disable VAT validation, save settings to JSON
- vat restore: Restore settings from JSON file
- vat validate: Batch VIES validation with notifications

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add VIES/VAT Manager to API reference (autodoc)
- Add Module Manager to API reference (autodoc)
- Add comprehensive VAT Validation Management guide section
- Include CLI usage examples, programmatic usage, and custom validators

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add return type annotations to test functions
- Fix S110: Add logging to try-except-pass blocks
- Fix C901: Add noqa comments for complex functions
- Fix D417: Add missing docstring parameter descriptions
- Fix E501: Break long lines
- Fix RUF059: Remove/rename unused variables
- Use Optional[str] instead of str | None for Python 3.9 compatibility
- Replace assert type narrowing with conditional checks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- validation.py: Cast search_count comparisons to bool explicitly
- idempotent.py: Rename loop variable to avoid redefinition
- preflight.py: Cast check_refs comparisons to bool explicitly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enables parsing date/datetime columns with custom formats using Polars'
vectorized str.to_date() and str.to_datetime() for efficient conversion.

Example usage:
    processor = Processor(
        mapping={},
        dataframe=df,
        date_formats={"birth_date": "%d/%m/%Y"},
        datetime_formats={"created_at": "%d/%m/%Y %H:%M:%S"},
    )

This provides an alternative to Polars' automatic date detection
(try_parse_dates=True) for cases where explicit format control is needed,
such as ambiguous date formats (DD/MM vs MM/DD).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addresses timeout issues when using --move-date with large inventory
imports on production databases:

1. Longer timeout for post-action (10 minutes)
   - Uses socket.setdefaulttimeout() for RPC calls
   - Handles timeout/connection errors gracefully
   - Returns success even on timeout (server may have completed)

2. Extract product IDs before post-action
   - New _get_product_ids_from_quants() helper function
   - Product IDs captured while connection is reliable
   - Allows move identification even after timeout

3. Time window fallback (2 hours)
   - Replaces exact timestamp filtering
   - Finds moves by product + inventory location + recent create_date
   - Handles cases where server completes after client timeout

4. Added diagnostic logging
   - Logs when product IDs are extracted and how many
   - Warns when move date update is skipped (no products or failed post-action)
   - Helps troubleshoot issues in production

5. Comprehensive test coverage
   - Tests for timeout handling
   - Tests for product ID extraction
   - Tests for move date update flow
   - Tests for edge cases (empty products, failed post-action)

6. Updated documentation
   - Explains timeout handling behavior
   - Added troubleshooting entries for timeout scenarios

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@bosd bosd force-pushed the feature/production-ready-etl branch from 1cf2e5b to c2df94b Compare January 21, 2026 16:06
bosd and others added 5 commits January 21, 2026 21:14
When reading CSV files with polars, the library infers column types
by examining the first N rows. If a column like 'default_code' has
numeric values in early rows and alphanumeric values later (e.g.,
"eWB0071-ASSY-11"), polars would infer it as integer and fail.

This was causing errors in fail mode imports:
  "Could not read csv header: could not parse: "eWB0071-ASSY-11"
   as dtype `i64` at column `default_code`"

Fixed by adding `infer_schema_length=0` to all pl.read_csv calls
in preflight.py and importer.py. This forces polars to read all
columns as strings, which is the correct behavior for a data
import tool where we don't need type inference.

Files fixed:
- src/odoo_data_flow/lib/preflight.py (3 occurrences)
- src/odoo_data_flow/importer.py (1 occurrence)

Note: sort.py already had this fix.
Addresses stability issues when importing to remote Odoo servers with
limited workers (e.g., single worker hosting):

1. Added new transient error patterns for server crash detection:
   - JSONDecodeError / "expecting value" (empty response)
   - "empty response", "incomplete read", "eof occurred"
   - "broken pipe", "connection aborted", "remotedisconnected"
   - "500" internal server error
   - "server closed connection"

2. Enhanced server overload detection in import_threaded.py:
   - Expanded pattern matching for crash indicators
   - Longer backoff for likely crashes (5s base, up to 120s max)
   - Standard backoff for overload (1s base, up to 60s max)
   - Clear messaging: "Server crash/empty response" vs "Server overload"

3. Added tests for new error patterns:
   - test_categorize_transient_json_decode_error
   - test_categorize_transient_empty_response
   - test_categorize_transient_connection_reset
   - test_categorize_transient_broken_pipe
   - test_categorize_transient_500_error

These changes help the tool automatically recover when the Odoo server
crashes or restarts during large imports, which is common with single
worker configurations.
Changed adaptive_throttle default from False to True across:
- CLI (--adaptive-throttle/--no-adaptive-throttle)
- import_threaded.py
- importer.py

Since adaptive throttling only adds delays when server response times
degrade, there's minimal overhead for fast servers. For production
imports to remote servers (especially with limited workers), this
provides automatic protection against server overload.

Users who want maximum speed on local/powerful servers can use
--no-adaptive-throttle to disable it.
- Fix E501 line length issues throughout codebase
- Add noqa: C901 comments for complex functions
- Add missing docstring argument for connection parameter
- Fix test type annotations (Optional[dict] for context param)
- Fix test formatting issues
- Sort __all__ exports alphabetically

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The vies_manager module was incorrectly trying to parse INI-style config
files as YAML. This fix:
- Uses configparser (stdlib) to match conf_lib.py's approach
- Removes the unnecessary pyyaml dependency
- Updates the test to use INI format

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@bosd bosd force-pushed the feature/production-ready-etl branch from 05e031a to 571e559 Compare January 23, 2026 14:19
bosd and others added 22 commits January 23, 2026 19:20
- Add explicit `return None` for early returns in run_import()
- Update _get_env_from_config() to accept Optional config parameter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive tests across multiple modules to reach the 85% coverage
threshold. Key areas covered include checkpoint cleanup, phone normalization,
config file handling, throttle controller, retry logic, validation edge cases,
and various expression functions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Windows defaults to cp1252 encoding which cannot handle Cyrillic
characters in geonames test data. Explicitly specifying UTF-8 encoding
in all write_text() calls fixes the UnicodeEncodeError on Windows CI.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When importing related models like res.partner.bank, tracking_disable
alone doesn't prevent chatter messages on the parent res.partner record.

Added additional Odoo context keys:
- mail_create_nolog: Don't log record creation
- mail_notrack: Don't track field changes
- mail_activity_automation_skip: Skip activity automation

These flags are now set automatically when tracking_disable is True,
ensuring complete suppression of mail/chatter messages during imports.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, post-actions like action_apply_inventory were only executed
when the import was fully successful. This caused stock.quant inventory
adjustments to remain in draft state when any records failed.

Changes:
- run_import now returns the id_map on partial failure instead of None,
  preserving the successfully imported record IDs
- import_cmd now runs post-action whenever import_result is not None
  (i.e., whenever the import process ran, even with partial failures)
- Only critical failures (process crash) skip the post-action
- Added "Import Partially Complete" panel showing success/failure counts
The export was incorrectly handling many2many fields with /.id format,
returning only the first ID instead of all IDs. This was because both
many2one (id, name) tuples and many2many [id1, id2, ...] lists were
treated identically.

Now properly differentiates:
- many2one: extracts single ID from (id, name) tuple
- many2many/one2many: joins all IDs with comma separator

Also fixes the field type inference to use 'char' for many2many /.id
fields (comma-separated string) vs 'integer' for many2one.
Odoo returns [id, name] lists (not tuples) for many2one fields.
The fix now properly distinguishes:
- many2one: [id, display_name] -> extract just the ID
- many2many/one2many: [id1, id2, ...] -> join with comma
The hybrid export mode was only handling many2one fields for XML ID
enrichment. Many2many and one2many fields returned incorrect results
because the code assumed a (id, name) tuple format instead of a list
of IDs.

Changes:
- Store relation_type (many2one/many2many/one2many) in fields_info
- Pass relation_type to enrichment tasks
- Rewrite _enrich_with_xml_ids to handle both field types:
  - many2one: single XML ID from (id, name) tuple
  - many2many/one2many: comma-separated XML IDs from [id1, id2, ...] list
- Records without XML IDs are excluded from the output (not null placeholders)

Added tests:
- test_export_hybrid_mode_many2many_xml_ids: basic many2many /id export
- test_export_hybrid_mode_many2many_partial_xml_ids: some records lack XML IDs
- test_export_hybrid_mode_many2many_empty: empty many2many returns None
- test_export_many2many_xml_ids_to_file: e2e test with file output
- test_export_one2many_xml_ids: one2many field handling
The 'force_company' context key is deprecated in Odoo 18 and causes
warnings in the server logs. The modern approach is to use only
'allowed_company_ids' which is supported in Odoo 13+.

Note: .with_company(ID) is a Python ORM method that cannot be called
via RPC - it internally sets context keys. For RPC calls,
allowed_company_ids is the correct approach.
Implement intelligent batch splitting based on estimated payload size
to prevent server timeouts when importing records with large binary
fields like images.

Changes:
- Add _estimate_payload_size() and _estimate_row_size() helper functions
- Add DEFAULT_MAX_BATCH_BYTES constant (5MB default)
- Update _stream_csv_batches() to split batches when size limit exceeded
- Update _orchestrate_pass_2() to use size-based super-batch aggregation
- Add --max-batch-bytes CLI option to import command

Both Pass 1 (load) and Pass 2 (write deferred fields) now respect
the size limit. Batches are split when either the record count OR
the payload size exceeds the configured limits.

This fixes timeouts during product template imports with large images
where a batch of 10 records could result in 50MB+ payloads.
- Add max_batch_bytes parameter to _orchestrate_streaming_pass_1
- Add docstring documentation for max_batch_bytes in import_data
- Fix unused variable warnings in tests (prefix with underscore)
- Shorten long docstrings to comply with line length limit
- Add noqa: C901 to complex functions in export_threaded
- Add type annotations to nested test functions
- Add 'assert error is not None' before using error in string operations
- Fix MockColumn dtype annotation to use type[pl.DataType]
- Add type annotation for rows list in test_idempotent
- Change output parameter in run_export to Optional[str]
- Fix typeguard issue by using intermediate Any-typed variable for json.loads
- Import Any in test_vies_manager
The Pass 2 deferred field update was passing single integer IDs for
many2many fields, causing Odoo ValueError. Odoo requires list format
[id] or command format [(6, 0, [ids])] for many2many writes.

Changes:
- Added field type detection using model.fields_get() to identify m2m
- Implemented proper value wrapping with [(6, 0, [ids])] command format
- Added handling for comma-separated multiple values
- Added comprehensive unit tests for m2m Pass 2 handling

This fixes the ValueError: "Wrong value for product.template.accessory_product_ids"
error during product template imports with accessory/optional product relations.
The grouping logic needed to convert nested lists inside tuples to
tuples recursively to make them hashable. Also improved the reverse
conversion to properly restore Odoo m2m command format [(6, 0, [ids])].
- Track serialization errors in failed_lines instead of silently dropping
- Add logging for malformed rows in streaming mode
- Add reconciliation check comparing total vs (created + failed)
- Display warning panel when records are unaccounted for
- Add failed_records and unaccounted_records to import stats

Also fixes:
- Python 3.9 compatibility in test_geonames.py (Path | None -> Optional)
- Remove broken test file with non-existent function imports
- Update test for serialization error behavior change

Closes #178

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add explicit dict[str, Any] type annotation to fix mypy error where
update_vals holds both int values (many2one) and list[tuple] values
(many2many commands).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When deferred field values are not in id_map (which only contains
records from the current model import), the code now checks if the
value looks like an XML ID (contains a dot separator like module.name)
and tries to resolve it via ir.model.data.

This fixes the issue where cross-model references like:
- user_id referencing res.users
- state_id referencing res.country.state
- property_purchase_currency_id referencing res.currency

Were not being resolved because they weren't in the id_map built
during Pass 1 (which only contains res.partner records in this case).

The fix applies to both many2one and many2many deferred fields.

Closes #179

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add TestPreparePass2DataCrossModelResolution class with 4 tests:
- many2one cross-model reference resolution via ir.model.data
- XML ID resolution for columns without /id suffix
- many2many cross-model reference resolution
- verification that non-XML ID values are used directly

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
#180 - Fix nested fail file directory
When source file is in a directory matching the env_name (e.g.,
data/prod/file.csv with prod_connection.conf), no longer creates
nested data/prod/prod/ directory.

#181 - Better error messages for existing records
Added detection for "already exists" patterns (duplicate key, unique
constraint, circular references). Error messages now suggest using
--skip-existing flag.

#182 - Stop accumulating timestamped fail files
Fail files now always use the same name (model_fail.csv) and get
overwritten instead of creating timestamped copies.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Odoo returns datetime strings in format '2026-02-27 05:38:37' (space
separator), but Polars cast(Datetime, strict=False) cannot parse this
format and silently returns null.

Changed ODOO_TO_POLARS_MAP to keep date/datetime fields as strings,
preserving the values throughout the export process.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…187)

Added --sanitize-newlines flag to export command that optionally replaces
embedded newlines in text/char/html fields with a configurable delimiter.
This prevents CSV corruption when text fields contain embedded newlines.

Default behavior: newlines are preserved (no sanitization)
With flag: newlines replaced with specified string (e.g., " | ")

Changes:
- Added sanitize_newlines() function to clean_expr.py
- Added sanitize_newlines parameter to _clean_and_transform_batch()
- Added --sanitize-newlines CLI flag to export command
- Added 15 unit tests for newline sanitization

Usage:
  odoo-data-flow export --sanitize-newlines " | " ...
- Add explicit type annotations for dict[str, pl.DataType] in test files
  to fix mypy covariance issues with polars DataType classes
- Remove unused imports (datetime) from importer.py and writer.py
- Format test assertions to comply with line length limits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant