API Xena Browser

UCSC Xena Browser API

The Xena Browser module provides a Python API for programmatically downloading and managing TCGA cohort data from the UCSC Xena Browser.

Citation

If you use UCSC Xena Browser data in your research, please cite:

Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8

Overview

This module provides a clean, YAML-based configuration system for downloading TCGA cohort data from the UCSC Xena Browser.

Module Structure

src/oncolearn/api/xenabrowser/
├── builder.py               # Builder pattern for creating cohorts from YAML
├── xena_dataset.py          # Dataset class for Xena data
└── download.py              # Download utilities

data/xenabrowser/configs/    # YAML configuration files
├── acc.yaml
├── blca.yaml
├── brca.yaml
└── ... (all TCGA cohorts)

YAML Configuration Format

Each cohort is defined in a YAML file with the following structure:

cohort:
  code: BRCA
  name: TCGA-BRCA
  description: TCGA Breast Cancer cohort with multi-modal genomics data

datasets:
  - name: BRCA Gene Expression (HiSeq)
    description: Illumina HiSeq gene expression (RNAseq) data
    category: mrna_seq
    url: https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/HiSeqV2.gz
    filename: HiSeqV2.gz
    default_subdir: TCGA-BRCA/gene_expression
  
  # ... more datasets

API Usage

Basic Usage

from oncolearn.api.xenabrowser import XenaCohortBuilder

# Create a builder
builder = XenaCohortBuilder()

# Build and download a cohort
brca_cohort = builder.build_cohort("BRCA")
brca_cohort.download()  # Downloads all BRCA datasets

# Download to a specific directory
brca_cohort.download(output_dir="my_data/brca")

List Available Cohorts

from oncolearn.api.xenabrowser import XenaCohortBuilder

builder = XenaCohortBuilder()
cohorts = builder.list_available_cohorts()
print(cohorts)  # ['ACC', 'BLCA', 'BRCA', ...]

Access Individual Datasets

from oncolearn.api.xenabrowser import XenaCohortBuilder

builder = XenaCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# List all datasets
dataset_names = brca_cohort.list_datasets()
print(dataset_names)

# Download a specific dataset
gene_expr = brca_cohort.get_dataset("BRCA Gene Expression (HiSeq)")
gene_expr.download("my_data/brca/gene_expression")

Filter Datasets by Category

from oncolearn.api.xenabrowser import XenaCohortBuilder
from oncolearn.api.dataset import DataCategory

builder = XenaCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# Get all clinical datasets
clinical_datasets = brca_cohort.get_datasets_by_category(DataCategory.CLINICAL)

# Get all mutation datasets
mutation_datasets = brca_cohort.get_datasets_by_category(DataCategory.MUTATION)

Data Categories

Available data categories and their subcategories/aliases:

Primary Categories

mrna_seq: mRNA sequencing data
- Aliases: mrna, gene expression rnaseq, gene_expression_rnaseq
dna_seq: DNA sequencing data
- Aliases: dna
mirna_seq: microRNA sequencing data
- Aliases: mirna, stem loop expression, stem_loop_expression
cnv: Copy number variation
- Aliases: copy number, copy_number, copy number (gene-level), copy_number_gene_level
mutation: Somatic mutations
- Aliases: somatic mutation, somatic_mutation, somatic mutation (snps and small indels)
methylation: DNA methylation
- Aliases: dna methylation, dna_methylation
protein: Protein expression
- Aliases: protein expression, protein_expression
clinical: Clinical/phenotype data
- Aliases: phenotype
snp: SNP data
transcriptome: Transcriptome data
genomics: General genomics (includes ATAC-seq)
- Subcategories: atac-seq
metabolomics: Metabolomics data
proteomics: Proteomics data
image: Imaging data
manifest: Manifest files
multimodal: Combined data types

Adding New Datasets

To add a new dataset to an existing cohort:

Open the cohort's YAML file (e.g., configs/brca.yaml)
Add a new entry to the datasets list:

  - name: BRCA New Dataset
    description: Description of the new dataset
    category: appropriate_category
    url: https://download.url/dataset.gz
    filename: dataset.gz
    default_subdir: TCGA-BRCA/subdirectory

Save the file - no Python code changes needed!

Adding New Cohorts

To add a completely new cohort:

Create a new YAML file in configs/ (e.g., newcohort.yaml)
Follow the YAML structure shown above
The cohort will automatically be available via the builder

OncoLearn | A comprehensive toolkit for cancer genomics analysis and biomarker discovery.

API Xena Browser

UCSC Xena Browser API

Citation

Overview

Module Structure

YAML Configuration Format

API Usage

Basic Usage

List Available Cohorts

Access Individual Datasets

Filter Datasets by Category

Data Categories

Primary Categories

Adding New Datasets

Adding New Cohorts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OncoLearn Wiki

Overview

Getting Started

API

CLI

Modeling

Guides

Clone this wiki locally