Skip to content

Fix column discovery for S3 shards#81

Open
jpc wants to merge 9 commits into
jpc/async-pupyarrowfrom
jpc/fix-get_columns
Open

Fix column discovery for S3 shards#81
jpc wants to merge 9 commits into
jpc/async-pupyarrowfrom
jpc/fix-get_columns

Conversation

@jpc

@jpc jpc commented May 18, 2026

Copy link
Copy Markdown
Member

No description provided.

jpc and others added 9 commits May 17, 2026 01:18
- Use start_skip_samples from demuxer for seek adjustment (mp3 encoder
  delay, MP4 edit lists). The demuxer applies this skip at pts=0 but not
  after seeking.
- Read codec_delay from decoder's output stream info for codecs where
  the delay is set by the codec init (wmav2, opus).
- Switch read loop to PTS-based termination instead of sample counting,
  fixing truncation when seek lands far from target.
- Detect when seek lands at stream start (chunk0.pts < margin) and use
  tstart=0 timeline to match ffmpeg CLI behavior.
- Handle negative first-chunk PTS (some AAC files) by clamping to 0.

Test results: 100/100 full load, 93/100 seek (±5 sample tolerance).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- For short seeks (tstart < 5s) and unreliable codecs (wmav2/wmapro),
  always read from the start instead of seeking. This avoids the
  seek_landed_at_start heuristic entirely for short seeks and costs
  almost nothing for small files.
- Only apply start_skip_samples seek_adj when actually seeking, not
  when reading from start (where the decoder handles it automatically).
- Initialize seek_adj to 0.0 to avoid UnboundLocalError.

Comprehensive test: 1494/1580 pass (94.6%), up from ~85% before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build a packet index (128KB resolution) on first seek and use
seek_to_byte_offset for raw MPEG audio formats. This avoids the
slow sequential scan that ffmpeg's mp3 demuxer does for timestamp
seeks. The index PTS is used directly for trimming since the demuxer
doesn't update PTS after byte seek.

Also: no seek_adj for indexed seeks (index PTS is in raw timeline,
not the skip_samples-adjusted timeline).

Comprehensive test: 1556/1580 pass (98.5%) across 8 seek positions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockCache: sorted interval cache that merges overlapping byte ranges.
Designed for caching S3 range-read results.

LazyBuffer:
- enable_cache(readahead): activate block cache with configurable readahead
- prepopulate(ranges): pre-fetch byte ranges into cache (sync)
- async_prepopulate(ranges): parallel async variant
- read_range() checks cache before hitting reader, fetches with
  readahead on miss

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tig888 tig888 force-pushed the jpc/async-pupyarrow branch from 1f82a7a to 942bdec Compare June 1, 2026 09:55
@jpc jpc changed the base branch from jpc/async-pupyarrow to main June 2, 2026 14:46
@jpc jpc changed the base branch from main to jpc/async-pupyarrow June 2, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant