This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Developer Guide

All developer resources related to Parquet.

1: Sub-Projects
2: Building Parquet
3: Contributing to Parquet-Java
4: Releasing Parquet-Java

This section contains the developer specific documentation related to Parquet.

1 - Sub-Projects

The parquet-format project contains format specifications and Thrift definitions of metadata required to properly read Parquet files.

The parquet-java project is a Java library to read and write Parquet files. It consists of multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, to and from the Parquet format, along with Hadoop Input/Output Formats, Pig loaders, and other Java-based utilities for interacting with Parquet.

The parquet-cpp project is a C++ library to read-write Parquet files. It is part of the Apache Arrow C++ implementation, with bindings to Python, R, Ruby and C/GLib.

The parquet-rs project is a Rust library to read-write Parquet files.

The parquet-go project is a Golang library to read-write Parquet files. It is part of the Apache Arrow Go implementation.

The parquet-compatibility project (deprecated) contains compatibility tests that can be used to verify that implementations in different languages can read and write each other’s files. As of January 2022 compatibility tests only exist up to version 1.2.0.

2 - Building Parquet

How to build Parquet

Building Java resources can be build using mvn package. The current stable version should always be available from Maven Central.

C++ thrift resources can be generated via make.

Thrift can be also code-genned into any other thrift-supported language.

3 - Contributing to Parquet-Java

How to contribute to Parquet-Java

Pull Requests

We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the github.com/apache/parquet-java repository. If you’ve previously forked Parquet from its old location, you will need to add a remote or update your origin remote to https://github.com/apache/parquet-java.git. Here are a few tips to get your contribution in:

Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
Create an Issue on the Parquet-Java issues.
Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the Issue GH-2935: (ex: https://github.com/apache/parquet-java/pull/2951).
Make sure that your code passes the unit tests. You can run the tests with mvn test in the root directory.
Add new unit tests for your code.
All Pull Requests are tested automatically on GitHub Actions.

If you’d like to report a bug but don’t have time to fix it, you can still raise an issue, or email the mailing list (dev@parquet.apache.org).

Committers

Merging a Pull Request

Merging a pull request requires being a committer on the project and approval of the PR by a committer who is not the author.

A pull request can be merged through the GitHub UI. By default, only squash and merge is enabled on the project.

When the PR solves an existing issue, ensure that it references the issue in the Pull-Request template Closes #1234. This way the issue is linked to the PR, and GitHub will automatically close the relevant issue when the PR is being merged.

Semantic versioning

Parquet-Java leverages semantic versioning to ensure compatibility for developers and users of the libraries as APIs and implementations evolve. The Maven plugin japicmp enforces this, and will fail when an API is being changed without going through the correct deprecation cycle. This includes for all the modules, excluding: parquet-benchmarks, parquet-cli, parquet-tools, parquet-format-structures, parquet-hadoop-bundle and parquet-pig-bundle.

All interfaces, classes, and methods targeted for deprecation must include the following:

@Deprecated annotation on the appropriate element
@depreceted javadoc comment including: the version for removal, the appropriate alternative for usage
Replacement of existing code paths that use the deprecated behavior

/**
 * @param c the current class
 * @return the corresponding logger
 * @deprecated will be removed in 2.0.0; use org.slf4j.LoggerFactory instead.
 */
@Deprecated
public static Log getLog(Class<?> c) {
    return new Log(c);
}

Checking for API violations can be done by running mvn verify -Dmaven.test.skip=true japicmp:cmp.

Tracking issues using Milestones

When a PR is raised that fixes a bug, or a feature that you want to target a certain version, make sure to attach a milestone. This way other committers can track certain versions, and see what is still pending. For information on the actual release, please check the release page.

Maintenance branches

Once a PR has been merged to master, it can be that the commit needs to be backported to maintenance branches, (ex: 1.14.x). The easiest way is to do this locally:

Make sure that the remote is set up correctly:

git remote add github-apache git@github.com:apache/parquet-java.git

Now you can cherry-pick a PR to a previous branch:

get fetch --all
git checkout parquet-1.14.x
git reset --hard github-apache/parquet-1.14.x
git cherry-pick <hash-from-the-commit>
git push github-apache/parquet-1.14.x

Website

Release Documentation

To create documentation for a new release of parquet-format create a new .md file under content/en/blog/parquet-format. Please see existing files in that directory as an example.

To create documentation for a new release of parquet-java create a new .md file under content/en/blog/parquet-java. Please see existing files in that directory as an example.

Website development and deployment

Staging

To make a change to the staging version of the website:

Make a PR against the staging branch in the repository
Once the PR is merged, the Build and Deploy Parquet Site job in the deployment workflow will be run, populating the asf-staging branch on this repo with the necessary files.

Do not directly edit the asf-staging branch of this repo

Production

To make a change to the production version of the website:

Make a PR against the production branch in the repository
Once the PR is merged, the Build and Deploy Parquet Site job in the deployment workflow will be run, populating the asf-site branch on this repo with the necessary files.

Do not directly edit the asf-site branch of this repo

4 - Releasing Parquet-Java

How to release Parquet-Java

Setup

N.B. The mechanics of releasing parquet-format is the same (e.g. setting up keys, branching, votes, etc)

You will need:

PGP code signing keys, published in KEYS.
Permission to stage artifacts in Nexus.

Make sure you have permission to deploy Parquet artifacts to Nexus by pushing a snapshot:

mvn deploy

If you have problems, read the publishing Maven artifacts documentation.

Release process

Parquet uses the maven-release-plugin to tag a release and push binary artifacts to staging in Nexus. Once maven completes the release, the official source tarball is built from the tag.

0. Before you start the release process

Verify that the release is finished (no planned Issues/PRs are pending on the milestone)
Build and test the project
Create a new branch for the release if this is a new minor version. For example, if the new minor version is 1.14.0, create a new branch parquet-1.14.x

1. Run the prepare script

./dev/prepare-release.sh <version> <rc-number>

This runs maven’s release prepare with a consistent tag name. After this step, the release tag will exist in the git repository.

If this step fails, you can roll back the changes by running these commands.

find ./ -type f -name '*.releaseBackup' -exec rm {} \;
find ./ -type f -name 'pom.xml' -exec git checkout {} \;

2. Run release:perform to stage binaries

Upload binary artifacts for the release tag to Nexus:

mvn release:perform -DskipTests -Darguments=-DskipTests

3. In Nexus, close the staging repository

Closing a staging repository makes the binaries available in staging, but does not publish them.

Go to Nexus.
In the menu on the left, choose “Staging Repositories”.
Select the Parquet repository.
At the top, click “Close” and follow the instructions. For the comment use “Apache Parquet [Format] ”.

4. Run the source tarball script

dev/source-release.sh <version> <rc-number>

This script builds the source tarball from the release tag’s SHA1, signs it, and uploads the necessary files with SVN.

The source release is pushed to https://dist.apache.org/repos/dist/dev/parquet/

The last message from the script is the release commit’s SHA1 hash and URL for the VOTE e-mail.

5. Prepare the pre-release

Creating the pre-release will give the users the changelog to see if they need to validate certain functionality. First select the newly created rc (ex: apache-parquet-1.15.0-rc0) tag, and then the previous release (ex. apache-parquet-1.14.1). Hit the Generate release notes button to auto generate the notes. You can curate the notes a bit by removing unrelated changes (whitespace, test-only changes) and sorting them to make them easier to digest. Make sure to check the Set as pre-release checkbox as this is a release candidate.

5. Send a VOTE e-mail to dev@parquet.apache.org

Here is a template you can use. Make sure everything applies to your release.

Subject: [VOTE] Release Apache Parquet <VERSION> RC<NUM>


Hi everyone,

I propose the following RC to be released as official Apache Parquet <VERSION> release.

The commit id is <SHA1>
* This corresponds to the tag: apache-parquet-<VERSION>-rc<NUM>
* https://github.com/apache/parquet-java/tree/<SHA1>

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/parquet/<PATH>

You can find the KEYS file here:
* https://downloads.apache.org/parquet/KEYS

You can find the changelog here:
https://github.com/apache/parquet-java/releases/tag/apache-parquet-<VERSION>-rc<NUM>

Binary artifacts are staged in Nexus here:
* https://repository.apache.org/content/groups/staging/org/apache/parquet/

This release includes important changes that I should have summarized here, but I'm lazy.

Please download, verify, and test.

Please vote in the next 72 hours.

[ ] +1 Release this as Apache Parquet <VERSION>
[ ] +0
[ ] -1 Do not release this because...

Publishing after the vote passes

After a release candidate passes a vote, the candidate needs to be published as the final release.

1. Tag final release and set development version

./dev/finalize-release <release-version> <rc-num> <new-development-version-without-SNAPSHOT-suffix>

This will add the final release tag to the RC tag and sets the new development version in the pom files. If everything is fine push the changes and the new tag to GitHub: git push --follow-tags

2. Release the binary repository in Nexus

Releasing a binary repository publishes the binaries to public.

Go to Nexus.
In the menu on the left, choose “Staging Repositories”.
Select the Parquet repository.
At the top, click Release and follow the instructions. For the comment use “Apache Parquet [Format] ”.

3. Copy the release artifacts in SVN into releases

First, check out the candidates and releases locations in SVN:

svn mv https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-<VERSION>-rcN/ https://dist.apache.org/repos/dist/release/parquet/apache-parquet-<VERSION> -m "Parquet: Add release <VERSION>"

4. Update parquet.apache.org

Update the downloads page on parquet.apache.org. Instructions for updating the site are on the contribution page.

5. Add the release to GitHub

Add a new release to GitHub. First select the newly tag (ex: apache-parquet-1.15.0), and then the previous release (ex. apache-parquet-1.14.1). You can copy the release notes from the RC that passed the vote.

6. Send an ANNOUNCE e-mail to announce@apache.org and the dev list

[ANNOUNCE] Apache Parquet release <VERSION>


I'm pleased to announce the release of Parquet <VERSION>!

Parquet is a general-purpose columnar file format for nested data. It uses
space-efficient encodings and a compressed and splittable structure for
processing frameworks like Hadoop.

Changes are listed at: https://github.com/apache/parquet-java/releases/tag/apache-parquet-<VERSION>

This release can be downloaded from: https://parquet.apache.org/downloads/

Java artifacts are available from Maven Central.

Thanks to everyone for contributing!

6. Update parquet-format with feature enablement guidance

The recommendations for other feature enablement is generally tied to releases of parquet-java (details are in the parquet-format repo). As releases are made the specification should be updated to indicate the recommended dates for when a new feature may be enabled.

Release Cadence

Provided enough volunteers are available the Parquet community aims to have releases on a quarterly basis (Targets months are January, April, July and October). If a new major version is necessary it will be targetted for the October release.