Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/.markdownlint.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
// Rules tuned for Docusaurus-flavoured Markdown imported from upstream PXF docs.
// We skip rules that conflict with how the upstream content is structured today.
"default": true,
"MD013": false, // line length: many existing lines exceed sensible limits
"MD024": false, // duplicate headings: tolerated for parallel structure
"MD033": false, // inline HTML: required for legacy <a id>, <dt>, <dd>, etc.
"MD034": false, // bare URLs: appear in Bookbinder-era references
"MD041": false, // first line h1: frontmatter-driven titles cover this
"MD046": { "style": "fenced" }
}
62 changes: 31 additions & 31 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
# PXF Documentation

This directory contains the book and markdown source for the PXF docs. You can build the markdown into HTML output using [Bookbinder](https://github.com/cloudfoundry-incubator/bookbinder).

Bookbinder is a Ruby gem that binds together a unified documentation web application from markdown, html, and/or DITA source material. The source material for bookbinder must be stored either in local directories or in GitHub repositories. Bookbinder runs [middleman](http://middlemanapp.com/) to produce a Rackup app that can be deployed locally or as a Web application.

This document provides instructions for building the PXF documentation on your local system. It includes the sections:

* [About Bookbinder](#about)
* [Prerequisites](#prereq)
* [Building the Documentation](#building)
* [Getting More Information](#moreinfo)


<a name="about"></a>
## About Bookbinder

You use bookbinder from within a project called a **book**. The book includes a configuration file named `config.yml` that specifies the documentation repositories/directories to use as source material. Bookbinder provides a set of scripts to aggregate those repositories and publish them to various locations in your final web application.

PXF provides a preconfigured **book** in the `docs/book` directory of this repo. You can use this configuration to build HTML for the PXF docs on your local system.

### Building the Documentation Using Docker

1. You can use the Docker environment in `ci/docker/pxf-cbdb-dev/ubuntu` which contains the necessary tools for development.

2. A local version of the documentation should be available for viewing at [http://localhost:9292](http://localhost:9292)


<a name="moreinfo"></a>
## Getting More Information

Bookbinder provides additional functionality to construct books from multiple Github repos, to perform variable substitution, and also to automatically build documentation in a continuous integration pipeline. For more information, see [https://github.com/pivotal-cf/bookbinder](https://github.com/pivotal-cf/bookbinder).

The Markdown source in this directory is published as part of the Apache
Cloudberry website at <https://cloudberry.apache.org/docs>. The site build is
owned by [`apache/cloudberry-site`](https://github.com/apache/cloudberry-site)
(Docusaurus-based); this repository keeps the source-of-truth Markdown so
documentation can be edited alongside the PXF code it describes.

## Authoring guidelines

- Files use plain Markdown with Docusaurus-flavoured frontmatter:

```markdown
---
title: Reading and Writing HDFS Parquet Data
description: Read and write Parquet data in HDFS via PXF.
sidebar_position: 6
---
```

- Internal links use relative `.md` paths (for example,
`./hdfs_text.md` or `../administering/cfg_server.md#about-the-pxffsbasepath-property`).
Heading anchors are auto-generated by Docusaurus from the heading text.
- Images live under [`graphics/`](graphics/) and are referenced with relative
paths (for example, `../graphics/pxfarch.png`).
- The directory layout mirrors the website sidebar. Each sub-directory carries a
`_category_.json` describing its label and position; new categories should
follow the same pattern.

## Local preview

Clone `apache/cloudberry-site`, point its sync script at this checkout, and run
`npm start`. See the cloudberry-site repository for details.
10 changes: 10 additions & 0 deletions docs/access-hadoop/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"label": "Accessing Hadoop with PXF",
"link": {
"type": "doc",
"id": "access-hadoop/access_hdfs"
},
"position": 30,
"collapsible": true,
"collapsed": true
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
---
title: Accessing Hadoop
description: Accessing Hadoop services with PXF.
sidebar_position: 1
---

<!--
Expand All @@ -23,33 +25,33 @@ under the License.

PXF is compatible with Cloudera, Hortonworks Data Platform, and generic Apache Hadoop distributions. PXF is installed with HDFS, Hive, and HBase connectors. You use these connectors to access varied formats of data from these Hadoop distributions.

## <a id="hdfs_arch"></a>Architecture
## Architecture

HDFS is the primary distributed storage mechanism used by Apache Hadoop. When a user or application performs a query on a PXF external table that references an HDFS file, the Apache Cloudberry coordinator host dispatches the query to all segment instances. Each segment instance contacts the PXF Service running on its host. When it receives the request from a segment instance, the PXF Service:

1. Allocates a worker thread to serve the request from the segment instance.
2. Invokes the HDFS Java API to request metadata information for the HDFS file from the HDFS NameNode.

<span class="figtitleprefix">Figure: </span>PXF-to-Hadoop Architecture
<span className="figtitleprefix">Figure: </span>PXF-to-Hadoop Architecture

![Greenplum Platform Extenstion Framework to Hadoop Architecture](graphics/pxfarch.png "Apache Cloudberry Platform Extension Framework-to-Hadoop Architecture")
![Apache Cloudberry Platform Extenstion Framework to Hadoop Architecture](../graphics/pxfarch.png "Apache Cloudberry Platform Extension Framework-to-Hadoop Architecture")

A PXF worker thread works on behalf of a segment instance. A worker thread uses its Apache Cloudberry `gp_segment_id` and the file block information described in the metadata to assign itself a specific portion of the query data. This data may reside on one or more HDFS DataNodes.

The PXF worker thread invokes the HDFS Java API to read the data and delivers it to the segment instance. The segment instance delivers its portion of the data to the Apache Cloudberry coordinator host. This communication occurs across segment hosts and segment instances in parallel.


## <a id="hadoop_prereq"></a>Prerequisites
## Prerequisites

Before working with Hadoop data using PXF, ensure that:

- You have configured PXF, and PXF is running on each Apache Cloudberry host. See [Configuring PXF](instcfg_pxf.html) for additional information.
- You have configured the PXF Hadoop Connectors that you plan to use. Refer to [Configuring PXF Hadoop Connectors](client_instcfg.html) for instructions. If you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
- You have configured PXF, and PXF is running on each Apache Cloudberry host. See [Configuring PXF](../administering/configuring/instcfg_pxf.md) for additional information.
- You have configured the PXF Hadoop Connectors that you plan to use. Refer to [Configuring PXF Hadoop Connectors](../administering/configuring/hadoop-connectors/client_instcfg.md) for instructions. If you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
- If user impersonation is enabled (the default), ensure that you have granted read (and write as appropriate) permission to the HDFS files and directories that will be accessed as external tables in Apache Cloudberry to each Apache Cloudberry user/role name that will access the HDFS files and directories. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user.
- Time is synchronized between the Apache Cloudberry hosts and the external Hadoop systems.


## <a id="hdfs_cmdline"></a>HDFS Shell Command Primer
## HDFS Shell Command Primer
Examples in the PXF Hadoop topics access files on HDFS. You can choose to access files that already exist in your HDFS cluster. Or, you can follow the steps in the examples to create new files.

A Hadoop installation includes command-line tools that interact directly with your HDFS file system. These tools support typical file system operations that include copying and listing files, changing file permissions, and so forth. You run these tools on a system with a Hadoop client installation. By default, Apache Cloudberry hosts do not
Expand Down Expand Up @@ -87,7 +89,7 @@ Display the contents of a text file located in HDFS:
$ hdfs dfs -cat /data/pxf_examples/example.txt
```

## <a id="hadoop_connectors"></a>Connectors, Data Formats, and Profiles
## Connectors, Data Formats, and Profiles

The PXF Hadoop connectors provide built-in profiles to support the following data formats:

Expand All @@ -105,26 +107,26 @@ The PXF Hadoop connectors expose the following profiles to read, and in many cas

| Data Source | Data Format | Profile Name(s) | Foreign Data Wrapper format | Supported Operations |
|-------------|------|---------|-----|-----|
| HDFS | delimited single line [text](hdfs_text.html#profile_text) | hdfs:text | text | Read, Write |
| HDFS | delimited single line comma-separated values of [text](hdfs_text.html#profile_text) | hdfs:csv | csv | Read, Write |
| HDFS | multi-byte or multi-character delimited single line [csv](hdfs_text.html#multibyte_delim) | hdfs:csv | csv | Read |
| HDFS | fixed width single line [text](hdfs_fixedwidth.html) | hdfs:fixedwidth | | Read, Write |
| HDFS | delimited [text with quoted linefeeds](hdfs_text.html#profile_textmulti) | hdfs:text:multi | text:multi | Read |
| HDFS | [Avro](hdfs_avro.html) | hdfs:avro | avro | Read, Write |
| HDFS | [JSON](hdfs_json.html) | hdfs:json | json | Read, Write |
| HDFS | [ORC](hdfs_orc.html) | hdfs:orc | orc | Read, Write |
| HDFS | [Parquet](hdfs_parquet.html) | hdfs:parquet | parquet | Read, Write |
| HDFS | delimited single line [text](./hdfs_text.md#reading-text-data) | hdfs:text | text | Read, Write |
| HDFS | delimited single line comma-separated values of [text](./hdfs_text.md#reading-text-data) | hdfs:csv | csv | Read, Write |
| HDFS | multi-byte or multi-character delimited single line [csv](./hdfs_text.md#about-reading-data-containing-multi-byte-or-multi-character-delimiters) | hdfs:csv | csv | Read |
| HDFS | fixed width single line [text](./hdfs_fixedwidth.md) | hdfs:fixedwidth | | Read, Write |
| HDFS | delimited [text with quoted linefeeds](./hdfs_text.md#reading-text-data-with-quoted-linefeeds) | hdfs:text:multi | text:multi | Read |
| HDFS | [Avro](./hdfs_avro.md) | hdfs:avro | avro | Read, Write |
| HDFS | [JSON](./hdfs_json.md) | hdfs:json | json | Read, Write |
| HDFS | [ORC](./hdfs_orc.md) | hdfs:orc | orc | Read, Write |
| HDFS | [Parquet](./hdfs_parquet.md) | hdfs:parquet | parquet | Read, Write |
| HDFS | AvroSequenceFile | hdfs:AvroSequenceFile | AvroSequenceFile | Read, Write |
| HDFS | [SequenceFile](hdfs_seqfile.html) | hdfs:SequenceFile | SequenceFile | Read, Write |
| [Hive](hive_pxf.html) | stored as TextFile | hive, [hive:text](hive_pxf.html#hive_text) | | Read |
| [Hive](hive_pxf.html) | stored as SequenceFile | hive | | Read |
| [Hive](hive_pxf.html) | stored as RCFile | hive, [hive:rc](hive_pxf.html#hive_hiverc) | | Read |
| [Hive](hive_pxf.html) | stored as ORC | hive, [hive:orc](hive_pxf.html#hive_orc) | orc | Read |
| [Hive](hive_pxf.html) | stored as Parquet | hive | | Read |
| [Hive](hive_pxf.html) | stored as Avro | hive | | Read |
| [HBase](hbase_pxf.html) | Any | hbase | - | Read |
| HDFS | [SequenceFile](./hdfs_seqfile.md) | hdfs:SequenceFile | SequenceFile | Read, Write |
| [Hive](./hive_pxf.md) | stored as TextFile | hive, [hive:text](./hive_pxf.md#accessing-textfile-format-hive-tables) | | Read |
| [Hive](./hive_pxf.md) | stored as SequenceFile | hive | | Read |
| [Hive](./hive_pxf.md) | stored as RCFile | hive, [hive:rc](./hive_pxf.md#accessing-rcfile-format-hive-tables) | | Read |
| [Hive](./hive_pxf.md) | stored as ORC | hive, [hive:orc](./hive_pxf.md#accessing-orc-format-hive-tables) | orc | Read |
| [Hive](./hive_pxf.md) | stored as Parquet | hive | | Read |
| [Hive](./hive_pxf.md) | stored as Avro | hive | | Read |
| [HBase](./hbase_pxf.md) | Any | hbase | - | Read |

### <a id="choose_profile"></a>Choosing the Profile
### Choosing the Profile

PXF provides more than one profile to access text and Parquet data on Hadoop. Here are some things to consider as you determine which profile to choose.

Expand All @@ -143,17 +145,17 @@ When accessing ORC-format data:
Choose the `hdfs:parquet` profile when the file is Parquet, you know the location of the file in the HDFS file system, and you want to take advantage of extended filter pushdown support for additional data types and operators.


### <a id="specify_profile"></a>Specifying the Profile for External Tables
### Specifying the Profile for External Tables

You must provide the profile name when you specify the `pxf` protocol in a `CREATE EXTERNAL TABLE` command to create a Apache Cloudberry external table that references a Hadoop file or directory, HBase table, or Hive table. For example, the following command creates an external table that uses the default server and specifies the profile named `hdfs:text` to access the HDFS file `/data/pxf_examples/pxf_hdfs_simple.txt`:
You must provide the profile name when you specify the `pxf` protocol in a `CREATE EXTERNAL TABLE` command to create an Apache Cloudberry external table that references a Hadoop file or directory, HBase table, or Hive table. For example, the following command creates an external table that uses the default server and specifies the profile named `hdfs:text` to access the HDFS file `/data/pxf_examples/pxf_hdfs_simple.txt`:

``` sql
CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
```

### <a id="specify_fdw_profile"></a>Specifying the Profile for Foreign Tables
### Specifying the Profile for Foreign Tables

When you use the `hdfs_pxf_fdw`, `hive_pxf_fdw`, or `hbase_pxf_fdw` foreign data wrapper in a `CREATE FOREIGN TABLE` command, you must specify a server name you configuredin Prerequisites section above. The foreign table can reference a Hadoop file or directory, an HBase table, or a Hive table. For example, the following commands create a foreign server named `hadoop_server` with the `hdfs_pxf_fdw` foreign data wrapper, then create a foreign table that uses the `text` format to access the HDFS file `data/pxf_examples/pxf_hdfs_simple.txt`:

Expand Down
Loading
Loading