Skip to content

Add Google Voice Takeout import support#225

Closed
sternryan wants to merge 1 commit intowesm:mainfrom
sternryan:gvoice-upstream
Closed

Add Google Voice Takeout import support#225
sternryan wants to merge 1 commit intowesm:mainfrom
sternryan:gvoice-upstream

Conversation

@sternryan
Copy link
Copy Markdown
Contributor

Summary

  • New sync-gvoice command that imports SMS, MMS, and call records from Google Takeout Voice exports
  • Follows the established adapter pattern, implementing gmail.API interface for plug-and-play integration
  • Deterministic message IDs via SHA-256 for idempotent re-imports

New files

  • internal/gvoice/client.go - gmail.API implementation over Takeout HTML
  • internal/gvoice/parser.go - HTML conversation parser
  • internal/gvoice/models.go - conversation and message types
  • internal/gvoice/parser_test.go - parser tests
  • cmd/msgvault/cmd/sync_gvoice.go - CLI command

Usage

msgvault sync-gvoice --takeout-dir ~/path/to/Takeout/Voice

Performance

  • Indexes ~120k entries from ~50k files in ~6 seconds
  • Full import at ~1,500 messages/sec

Implements sync-gvoice command that imports SMS, MMS, and call records
from a Google Takeout Voice export. Follows the established adapter
pattern from the iMessage integration, implementing gmail.API interface
to plug into the existing sync infrastructure.

Key features:
- Parses HTML conversation files for text messages and call logs
- Handles 1:1 texts, group conversations, and call records
- Deterministic message IDs via SHA-256 for idempotent re-imports
- Indexes ~120k entries from ~50k files in ~6 seconds
- Full import at ~1,500 messages/sec
@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Mar 26, 2026

roborev: Combined Review (cb5595d)

Verdict: The PR successfully introduces the Google Voice Takeout parser and sync pipeline, but requires crucial fixes for potential panics, data
races, and an O(N²) performance bottleneck before merging.

High Severity

  • Location: internal/gvoice/client.go:153 (in buildIndex)
    **
    Problem:** The error path dereferences entry.ID after c.indexCallFile(...) returns an error. In that case, entry is nil, so a single malformed or unsupported call HTML file will panic the entire import instead of being skipped.
    Fix: Log the filename/path
    already in scope instead of entry.ID, or guard the dereference before logging.

  • Location: internal/gvoice/client.go (in GetMessagesRawBatch)

Problem: Continuing after a GetMessageRaw error leaves nil pointers in the pre-allocated results slice. If callers iterate over the returned slice and assume all items are valid, it will trigger a panic.
Fix: Initialize results with zero length (make([]*gmail .RawMessage, 0, len(messageIDs))) and append messages on success.

Medium Severity

  • Location: internal/gvoice/client.go (in
    getCachedMessages and buildIndex)
    Problem: The file cache (lastFilePath, lastMessages) and the lazy index initialization (c.indexBuilt) lack mutex protection. If the syncer fetches message batches concurrently, this will cause a data race.
    Fix: Protect the cache
    fields with a sync.Mutex and use sync.Once for buildIndex.

  • Location: internal/gvoice/client.go (in GetMessageRaw)

    Problem: A linear scan over c.index is performed for every message fetched. Doing this for a large number of messages yields O(N²) complexity, which will cause severe CPU stalling during large syncs.
    Fix: Populate a map[string]*indexEntry during build Index() to allow O(1) lookups by ID.

  • Location: internal/gvoice/client.go:219 and internal/gvoice/client.go: 425
    Problem: For 1:1 conversations, the code derives the other participant only by scanning for a non-Me message. If a Takeout thread contains only outbound messages
    , indexTextFile falls back to the owner's own number for the thread ID and buildTextMessage emits no recipient at all. That can merge unrelated sent-only threads together and lose addressing metadata.
    Fix: Use the conversation/file metadata (contactName, filename, or parsed conversation header) as
    the fallback participant when no inbound message exists, and add a test for sent-only threads.

  • Location: internal/gvoice/client.go:449 and internal/g voice/client.go:461
    Problem: MMS attachments are detected during HTML parsing, but buildTextMessage only adds the mms label and generates a plain-text MIME body with no
    attachment parts. The import drops all image/video content from MMS messages.
    Fix: Build multipart MIME that includes the referenced media files (or persist them through the attachment pipeline) and cover this with tests for image/video MMS messages.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm pushed a commit that referenced this pull request Apr 1, 2026
Import Google Voice history from Google Takeout exports into msgvault.
Parses HTML conversation files, VCF contacts, and call logs from the
Takeout directory structure.

Includes:
- Takeout directory parser for texts, voicemails, and calls
- HTML conversation parser with timestamp and participant extraction
- VCF contact parser for Google Voice number detection
- CLI command (sync-gvoice) with conversation deduplication
- Parser tests for HTML and VCF extraction

Co-Authored-By: Ryan Stern <206953196+vanboompow@users.noreply.github.com>
wesm added a commit that referenced this pull request Apr 1, 2026
Design for merging WhatsApp (#160), iMessage (#224), and Google Voice
(#225) import implementations into a coherent system with shared
phone-based participant model, proper schema usage, and dedicated TUI
Texts mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wesm
Copy link
Copy Markdown
Owner

wesm commented Apr 1, 2026

Superseded by #238 which squash-merges this PR and refactors it alongside WhatsApp (#160) and iMessage (#224) into a unified text message import system.

@wesm wesm closed this Apr 1, 2026
wesm added a commit that referenced this pull request Apr 2, 2026
)

## Summary

Unifies three independent text message import implementations into a
coherent system with a shared database schema, phone-based participant
model, and dedicated TUI Texts mode.

Supersedes #160 (WhatsApp), #224 (iMessage), #225 (Google Voice) —
squash-merged all three, refactored for consistency, and built a shared
foundation. Original contributor commits preserved.

## Import commands

```
msgvault import-whatsapp <msgstore.db> --phone +1...
msgvault import-imessage [--me +1...]
msgvault import-gvoice <takeout-dir>
```

Deprecated `import --type whatsapp` alias kept for backward
compatibility.

## Shared foundation

- `NormalizePhone` E.164 utility with international format support
(`00`-prefix, trunk `(0)`, whitespace)
- `EnsureParticipantByPhone(phone, displayName, identifierType)` —
cross-source phone-based participant dedup
- `RecomputeConversationStats` — idempotent post-import stats
recomputation
- `LinkMessageLabel` — single-label convenience wrapper

## Importer changes

**iMessage**: Dropped `gmail.API` adapter and synthetic MIME. Reads
`chat.db` directly, writes proper `message_type` (`imessage`/`sms`),
`sender_id`, `conversation_type` (`group_chat`/`direct_chat`),
`message_recipients`, and raw JSON. Handles macOS streamtyped
`attributedBody` format (Ventura+/Sequoia). Group chats detected from
`;+;` GUID prefix with participant-derived titles.

**Google Voice**: Same refactoring pattern. Three `message_type` values
(`google_voice_text`/`call`/`voicemail`), phone-based participants,
labels, raw HTML storage, correct outbound recipients.

**WhatsApp**: Cleaned up to use shared utilities. Skips broken
attachment rows when `--media-dir` not provided.

## Query layer

- `TextEngine` interface (separate from `Engine` to avoid rippling
through remote/API/MCP layers)
- DuckDB and SQLite implementations with conversation-first queries
- Parquet cache extended with `conversation_type`, schema v5
- Email-only filter on existing `Engine` queries (search, stats,
aggregates)
- FTS backfill handles phone-based senders via `sender_id`

## TUI Texts mode

Press `m` to toggle between Email and Texts modes.

- **Conversations view**: sortable by name/count/last message,
content-aware column widths
- **Aggregate views**: Contacts, Contact Names, Sources, Labels, Time
- **Chat timeline**: full message bodies with word wrapping, reverse
sort (`r`), local search (`/`)
- **Navigation**: drill-down with breadcrumbs, consistent keybindings
with Email mode
- Read-only (no deletion staging)

## Schema

- `conversations.conversation_type` column + migration for legacy DBs
- `SQLite MaxOpenConns(4)` for concurrent TUI reads (`:memory:` stays at
1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Ed Dowding <me@eddowding.com>
Co-authored-by: Ryan Stern <206953196+vanboompow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants