AI Tools

Collection of utility scripts for analyzing datasets and AI model benchmarking.

analyze_parquet_distribution.py

Analyzes parquet dataset files to understand their token distribution, ordering, and structure.

Features

📊 Comprehensive token distribution analysis
🔍 Ordering detection (sorted vs random)
📈 Visual histogram of token ranges
🎯 Check for specific token length targets
💾 Export dataset to JSON format

Prerequisites

# Activate virtual environment with required dependencies
source ~/test_foo/python3_virt/bin/activate

Required packages: pandas, numpy, pyarrow (for parquet support)

Usage

Basic Analysis

# Analyze a parquet file
python3 analyze_parquet_distribution.py /path/to/dataset.parquet

Export to JSON

# Analyze and export to JSON
python3 analyze_parquet_distribution.py /path/to/dataset.parquet --output-json output.json

Full Example

# Analyze the GPT-OSS performance evaluation dataset
source ~/test_foo/python3_virt/bin/activate

cd /Users/USER/workspace/ai-tools

python3 analyze_parquet_distribution.py \
  /Users/USER/rhoai-install-stuff/real_datasets/gpt-oss/perf/perf_eval_ref.parquet \
  --output-json /tmp/perf_eval_dataset.json

Output

The script provides:

Basic Info: Total samples, columns, dataset sources
Token Distribution: Statistical summary (mean, median, min, max, percentiles)
Distribution by Ranges: Sample counts in 0-1K, 1K-2K, 2K-4K, 4K-6K, 6K-8K, 8K-10K, 10K+ ranges
Ordering Analysis: Whether samples are sorted or randomly shuffled
Visual Histogram: Text-based histogram of token distribution
Specific Targets: Count of samples near common token lengths (512, 1024, 2048, 4096, 8192)
Sample Records: First 3 records with full details

Example Output

================================================================================
Analyzing: /Users/USER/rhoai-install-stuff/real_datasets/gpt-oss/perf/perf_eval_ref.parquet
================================================================================

📊 BASIC INFO
Total samples:                 6,396
Columns:                       ['prompt', 'dataset', 'input_tokens', 'num_tokens', 'text_input']
Datasets:                      ['pubmed_summarization']

📈 TOKEN LENGTH DISTRIBUTION

count     6396.000000
mean      5010.643684
std       2243.906145
min        600.000000
25%       3492.500000
50%       4593.000000
75%       6046.250000
max      15330.000000
Name: num_tokens, dtype: float64

📊 DISTRIBUTION BY RANGES
  0-1K           12 samples (  0.2%)
  1K-2K         241 samples (  3.8%)
  2K-4K       2,106 samples ( 32.9%)
  4K-6K       2,395 samples ( 37.4%)
  6K-8K       1,014 samples ( 15.9%)
  8K-10K        386 samples (  6.0%)
  10K+          242 samples (  3.8%)

🔍 ORDERING ANALYSIS
Sorted ascending (by num_tokens):   False
Sorted descending (by num_tokens):  False

📊 VISUAL DISTRIBUTION (histogram)
================================================================================
0-1000            12
1000-2000    █████    241
2000-4000    ███████████████████████████████████████████  2,106
4000-6000    ██████████████████████████████████████████████████  2,395
6000-8000    █████████████████████  1,014
8000-10000   ████████    386
10000-16000  █████    242
================================================================================

JSON Export

When using --output-json, the dataset is exported in JSON format:

[
  {
    "prompt": "lepidoptera include agricultural pests...",
    "dataset": "pubmed_summarization",
    "input_tokens": 8254,
    "num_tokens": 8254,
    "text_input": "<|start|>system<|message|>You are ChatGPT..."
  },
  {
    "prompt": "midwife - led primary delivery care...",
    "dataset": "pubmed_summarization",
    "input_tokens": 2296,
    "num_tokens": 2296,
    "text_input": "<|start|>system<|message|>You are ChatGPT..."
  }
]

Note: JSON files are typically 3-4x larger than parquet files due to less efficient compression.

Command-Line Options

usage: analyze_parquet_distribution.py [-h] [--output-json FILE] parquet_path

positional arguments:
  parquet_path        Path to the parquet file to analyze

options:
  -h, --help          Show this help message and exit
  --output-json FILE  Save the dataset to JSON format at the specified path

Use Cases

Dataset Discovery: Understand token distribution before running benchmarks
Benchmark Planning: Identify how many samples fall into specific token ranges
Data Validation: Verify dataset structure and ordering
Format Conversion: Convert parquet files to JSON for use with tools that don't support parquet
GuideLLM Integration: Filter datasets by token length for targeted benchmarking

Tips

Large Files: Parquet analysis is memory-efficient, but JSON export can create very large files
Token Ranges: Use the distribution info to select appropriate --data-args filters for GuideLLM
Random vs Sorted: Random ordering is better for benchmark fairness
Virtual Environment: Always activate the virtual environment before running

Related Tools

This script is commonly used with:

GuideLLM: For LLM benchmarking with real datasets
HuggingFace Datasets: For dataset downloading and preprocessing
Pandas: For further data analysis and manipulation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
analyze_parquet_distribution.py		analyze_parquet_distribution.py
run_multi_concurrency_benchmark_local_v2.py		run_multi_concurrency_benchmark_local_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Tools

analyze_parquet_distribution.py

Features

Prerequisites

Usage

Basic Analysis

Export to JSON

Full Example

Output

Example Output

JSON Export

Command-Line Options

Use Cases

Tips

Related Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Tools

analyze_parquet_distribution.py

Features

Prerequisites

Usage

Basic Analysis

Export to JSON

Full Example

Output

Example Output

JSON Export

Command-Line Options

Use Cases

Tips

Related Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages