Collection of utility scripts for analyzing datasets and AI model benchmarking.
Analyzes parquet dataset files to understand their token distribution, ordering, and structure.
- 📊 Comprehensive token distribution analysis
- 🔍 Ordering detection (sorted vs random)
- 📈 Visual histogram of token ranges
- 🎯 Check for specific token length targets
- 💾 Export dataset to JSON format
# Activate virtual environment with required dependencies
source ~/test_foo/python3_virt/bin/activateRequired packages: pandas, numpy, pyarrow (for parquet support)
# Analyze a parquet file
python3 analyze_parquet_distribution.py /path/to/dataset.parquet# Analyze and export to JSON
python3 analyze_parquet_distribution.py /path/to/dataset.parquet --output-json output.json# Analyze the GPT-OSS performance evaluation dataset
source ~/test_foo/python3_virt/bin/activate
cd /Users/USER/workspace/ai-tools
python3 analyze_parquet_distribution.py \
/Users/USER/rhoai-install-stuff/real_datasets/gpt-oss/perf/perf_eval_ref.parquet \
--output-json /tmp/perf_eval_dataset.jsonThe script provides:
- Basic Info: Total samples, columns, dataset sources
- Token Distribution: Statistical summary (mean, median, min, max, percentiles)
- Distribution by Ranges: Sample counts in 0-1K, 1K-2K, 2K-4K, 4K-6K, 6K-8K, 8K-10K, 10K+ ranges
- Ordering Analysis: Whether samples are sorted or randomly shuffled
- Visual Histogram: Text-based histogram of token distribution
- Specific Targets: Count of samples near common token lengths (512, 1024, 2048, 4096, 8192)
- Sample Records: First 3 records with full details
================================================================================
Analyzing: /Users/USER/rhoai-install-stuff/real_datasets/gpt-oss/perf/perf_eval_ref.parquet
================================================================================
📊 BASIC INFO
Total samples: 6,396
Columns: ['prompt', 'dataset', 'input_tokens', 'num_tokens', 'text_input']
Datasets: ['pubmed_summarization']
📈 TOKEN LENGTH DISTRIBUTION
count 6396.000000
mean 5010.643684
std 2243.906145
min 600.000000
25% 3492.500000
50% 4593.000000
75% 6046.250000
max 15330.000000
Name: num_tokens, dtype: float64
📊 DISTRIBUTION BY RANGES
0-1K 12 samples ( 0.2%)
1K-2K 241 samples ( 3.8%)
2K-4K 2,106 samples ( 32.9%)
4K-6K 2,395 samples ( 37.4%)
6K-8K 1,014 samples ( 15.9%)
8K-10K 386 samples ( 6.0%)
10K+ 242 samples ( 3.8%)
🔍 ORDERING ANALYSIS
Sorted ascending (by num_tokens): False
Sorted descending (by num_tokens): False
📊 VISUAL DISTRIBUTION (histogram)
================================================================================
0-1000 12
1000-2000 █████ 241
2000-4000 ███████████████████████████████████████████ 2,106
4000-6000 ██████████████████████████████████████████████████ 2,395
6000-8000 █████████████████████ 1,014
8000-10000 ████████ 386
10000-16000 █████ 242
================================================================================
When using --output-json, the dataset is exported in JSON format:
[
{
"prompt": "lepidoptera include agricultural pests...",
"dataset": "pubmed_summarization",
"input_tokens": 8254,
"num_tokens": 8254,
"text_input": "<|start|>system<|message|>You are ChatGPT..."
},
{
"prompt": "midwife - led primary delivery care...",
"dataset": "pubmed_summarization",
"input_tokens": 2296,
"num_tokens": 2296,
"text_input": "<|start|>system<|message|>You are ChatGPT..."
}
]Note: JSON files are typically 3-4x larger than parquet files due to less efficient compression.
usage: analyze_parquet_distribution.py [-h] [--output-json FILE] parquet_path
positional arguments:
parquet_path Path to the parquet file to analyze
options:
-h, --help Show this help message and exit
--output-json FILE Save the dataset to JSON format at the specified path
- Dataset Discovery: Understand token distribution before running benchmarks
- Benchmark Planning: Identify how many samples fall into specific token ranges
- Data Validation: Verify dataset structure and ordering
- Format Conversion: Convert parquet files to JSON for use with tools that don't support parquet
- GuideLLM Integration: Filter datasets by token length for targeted benchmarking
- Large Files: Parquet analysis is memory-efficient, but JSON export can create very large files
- Token Ranges: Use the distribution info to select appropriate
--data-argsfilters for GuideLLM - Random vs Sorted: Random ordering is better for benchmark fairness
- Virtual Environment: Always activate the virtual environment before running
This script is commonly used with:
- GuideLLM: For LLM benchmarking with real datasets
- HuggingFace Datasets: For dataset downloading and preprocessing
- Pandas: For further data analysis and manipulation