Skip to content

API Xena Browser

andrewscouten edited this page Mar 11, 2026 · 1 revision

UCSC Xena Browser API

The Xena Browser module provides a Python API for programmatically downloading and managing TCGA cohort data from the UCSC Xena Browser.

Citation

If you use UCSC Xena Browser data in your research, please cite:

Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8

Overview

This module provides a clean, YAML-based configuration system for downloading TCGA cohort data from the UCSC Xena Browser.

Module Structure

src/oncolearn/api/xenabrowser/
├── builder.py               # Builder pattern for creating cohorts from YAML
├── xena_dataset.py          # Dataset class for Xena data
└── download.py              # Download utilities

data/xenabrowser/configs/    # YAML configuration files
├── acc.yaml
├── blca.yaml
├── brca.yaml
└── ... (all TCGA cohorts)

YAML Configuration Format

Each cohort is defined in a YAML file with the following structure:

cohort:
  code: BRCA
  name: TCGA-BRCA
  description: TCGA Breast Cancer cohort with multi-modal genomics data

datasets:
  - name: BRCA Gene Expression (HiSeq)
    description: Illumina HiSeq gene expression (RNAseq) data
    category: mrna_seq
    url: https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/HiSeqV2.gz
    filename: HiSeqV2.gz
    default_subdir: TCGA-BRCA/gene_expression
  
  # ... more datasets

API Usage

Basic Usage

from oncolearn.api.xenabrowser import XenaCohortBuilder

# Create a builder
builder = XenaCohortBuilder()

# Build and download a cohort
brca_cohort = builder.build_cohort("BRCA")
brca_cohort.download()  # Downloads all BRCA datasets

# Download to a specific directory
brca_cohort.download(output_dir="my_data/brca")

List Available Cohorts

from oncolearn.api.xenabrowser import XenaCohortBuilder

builder = XenaCohortBuilder()
cohorts = builder.list_available_cohorts()
print(cohorts)  # ['ACC', 'BLCA', 'BRCA', ...]

Access Individual Datasets

from oncolearn.api.xenabrowser import XenaCohortBuilder

builder = XenaCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# List all datasets
dataset_names = brca_cohort.list_datasets()
print(dataset_names)

# Download a specific dataset
gene_expr = brca_cohort.get_dataset("BRCA Gene Expression (HiSeq)")
gene_expr.download("my_data/brca/gene_expression")

Filter Datasets by Category

from oncolearn.api.xenabrowser import XenaCohortBuilder
from oncolearn.api.dataset import DataCategory

builder = XenaCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# Get all clinical datasets
clinical_datasets = brca_cohort.get_datasets_by_category(DataCategory.CLINICAL)

# Get all mutation datasets
mutation_datasets = brca_cohort.get_datasets_by_category(DataCategory.MUTATION)

Data Categories

Available data categories and their subcategories/aliases:

Primary Categories

  • mrna_seq: mRNA sequencing data

    • Aliases: mrna, gene expression rnaseq, gene_expression_rnaseq
  • dna_seq: DNA sequencing data

    • Aliases: dna
  • mirna_seq: microRNA sequencing data

    • Aliases: mirna, stem loop expression, stem_loop_expression
  • cnv: Copy number variation

    • Aliases: copy number, copy_number, copy number (gene-level), copy_number_gene_level
  • mutation: Somatic mutations

    • Aliases: somatic mutation, somatic_mutation, somatic mutation (snps and small indels)
  • methylation: DNA methylation

    • Aliases: dna methylation, dna_methylation
  • protein: Protein expression

    • Aliases: protein expression, protein_expression
  • clinical: Clinical/phenotype data

    • Aliases: phenotype
  • snp: SNP data

  • transcriptome: Transcriptome data

  • genomics: General genomics (includes ATAC-seq)

    • Subcategories: atac-seq
  • metabolomics: Metabolomics data

  • proteomics: Proteomics data

  • image: Imaging data

  • manifest: Manifest files

  • multimodal: Combined data types

Adding New Datasets

To add a new dataset to an existing cohort:

  1. Open the cohort's YAML file (e.g., configs/brca.yaml)
  2. Add a new entry to the datasets list:
  - name: BRCA New Dataset
    description: Description of the new dataset
    category: appropriate_category
    url: https://download.url/dataset.gz
    filename: dataset.gz
    default_subdir: TCGA-BRCA/subdirectory
  1. Save the file - no Python code changes needed!

Adding New Cohorts

To add a completely new cohort:

  1. Create a new YAML file in configs/ (e.g., newcohort.yaml)
  2. Follow the YAML structure shown above
  3. The cohort will automatically be available via the builder

Clone this wiki locally