Introduced in Spark 4.x, Python Data Source API allows you to create PySpark Data Sources leveraging long standing python libraries for handling unique file types or specialized interfaces with spark read, readStream, write and writeStream APIs.
| Data Source Name | Purpose |
|---|---|
| mcap | Read MCAP (ROS 2 bag) files |
| mqtt | Stream data from MQTT brokers |
| zipdcm | Read DICOM files from Zip file archives |
Install the base package:
pip install python-data-sourcesInstall with specific data source support:
# Install with MCAP support
pip install python-data-sources[mcap]
# Install with MQTT support
pip install python-data-sources[mqtt]
# Install with ZipDCM support
pip install python-data-sources[zipdcm]
# Install with all data sources
pip install python-data-sources[all]from pyspark.sql import SparkSession
from python_data_sources.mcap import MCAPDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MCAPDataSource)
df = (
spark.read.format("mcap")
.option("path", "/path/to/data.mcap")
.load()
)from pyspark.sql import SparkSession
from python_data_sources.mqtt import MqttDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MqttDataSource)
df = (
spark.readStream.format("mqtt_pub_sub")
.option("broker_address", "mqtt.example.com")
.option("topic", "sensors/#")
.load()
)from pyspark.sql import SparkSession
from python_data_sources.zipdcm import ZipDCMDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ZipDCMDataSource)
df = (
spark.read.format("zipdcm")
.load("/path/to/dicom_files.zip")
)This project uses uv for build and environment management.
# Install uv
brew install uv
# Create development environment
make dev# Run tests for a specific submodule
make test-module MODULE=mcap
make test-module MODULE=mqtt
make test-module MODULE=zipdcmpython-data-sources/
├── pyproject.toml # Unified build configuration
├── src/
│ └── python_data_sources/ # Main package
│ ├── mcap/ # MCAP data source
│ ├── mqtt/ # MQTT streaming data source
│ └── zipdcm/ # ZipDCM data source
├── tests/
| └── unit/
│ ├── common/ # Common module tests
│ ├── mcap/ # MCAP tests
│ ├── mqtt/ # MQTT tests
│ └── zipdcm/ # ZipDCM tests
└── .github/workflows/
└── test.yml # CI/CD workflow
Refer to the python-data-sources documentation for detailed information on how to use supplied python data sources, its features, and configuration options.
See CONTRIBUTING.md for detailed information about contributing to the Python Data Sources library.
© 2025 Databricks, Inc. All rights reserved. The source in this project is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.
| Datasource | Package | Purpose | License | Source |
|---|---|---|---|---|
| mcap | mcap | Python API for MCAP files | MIT | https://github.com/foxglove/mcap |
| mcap | mcap-protobuf-support | Protobuf schema support | MIT | https://github.com/foxglove/mcap |
| mqtt | paho-mqtt | MQTT client library | EPL-2.0 / EDL-1.0 (BSD-3) | https://github.com/eclipse/paho.mqtt.python |
| zipdcm | pydicom | Python API for DICOM files | MIT | https://github.com/pydicom/pydicom |
| zipdcm | pylibjpeg | Decoding / Encoding pixel formats | MIT | https://github.com/pydicom/pylibjpeg |
| zipdcm | pylibjpeg-openjpeg | J2K, JP2, and HTJ2K plugin for pylibjpeg | MIT | https://github.com/pydicom/pylibjpeg-openjpeg |
| zipdcm | pylibjpeg-rle | RLE plugin for pylibjpeg | MIT | https://github.com/pydicom/pylibjpeg-rle |