pyspark-style-guide
by
palantir

Description: This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.

View palantir/pyspark-style-guide on GitHub ↗

Summary Information

Updated 2 hours ago
Added to GitGenius on September 5th, 2025
Created on October 15th, 2020
Open Issues/Pull Requests: 6 (+0)
Number of forks: 160
Total Stargazers: 1,226 (+0)
Total Subscribers: 248 (+0)
Detailed Description

The Palantir PySpark Style Guide, hosted at https://github.com/palantir/pyspark-style-guide, is a comprehensive document detailing best practices for writing maintainable, readable, and performant PySpark code. It’s not merely a linter configuration, but a philosophy emphasizing clarity and avoiding common pitfalls when working with distributed data processing. The guide aims to standardize PySpark development within Palantir, and is released publicly to benefit the wider community. It’s particularly valuable for teams working on large-scale data pipelines and applications.

A core tenet of the guide is prioritizing readability. It strongly advocates for explicit code over cleverness, favoring longer, more descriptive names for variables and functions over concise but ambiguous ones. This extends to avoiding overly complex one-liners and breaking down operations into smaller, named steps. The guide emphasizes the importance of comments, not to explain *what* the code does (which should be self-evident), but *why* it does it, particularly when dealing with complex business logic or data transformations. Consistent formatting, as enforced by tools like Black and isort, is also crucial for readability and is a non-negotiable aspect of the style.

The guide provides detailed recommendations on specific PySpark idioms and anti-patterns. It discourages the use of `collect()` on large datasets, highlighting the potential for out-of-memory errors and performance degradation. Instead, it promotes using `take()` for sampling or writing results to storage. It also cautions against using `foreach()` and `foreachPartition()` due to their lack of guarantees regarding execution order and potential for side effects. The guide strongly favors using Spark’s built-in functions and transformations over user-defined functions (UDFs) whenever possible, as UDFs can hinder optimization and serialization. When UDFs are necessary, it recommends using Pandas UDFs (Vectorized UDFs) for improved performance.

Regarding data handling, the style guide stresses the importance of schema definition. Explicitly defining schemas using `StructType` and `StructField` improves data quality, enables Spark’s Catalyst optimizer to work more effectively, and prevents runtime errors caused by schema mismatches. It also advocates for using appropriate data types to minimize storage and processing costs. The guide provides guidance on handling null values and avoiding common pitfalls related to data type conversions. It also recommends using Spark’s partitioning features to optimize data distribution and parallelism.

Finally, the guide covers testing and debugging strategies. It encourages writing unit tests to verify the correctness of individual transformations and functions. It also recommends using Spark’s logging framework for debugging and monitoring data pipelines. The guide acknowledges the challenges of debugging distributed applications and provides tips for identifying and resolving performance bottlenecks. The repository includes a configuration file for popular linters (like flake8 and pylint) to help automate style enforcement, and links to resources for setting up a development environment compliant with the guide.

pyspark-style-guide
by
palantirpalantir/pyspark-style-guide

Repository Details

Fetching additional details & charts...