pyjanitor
by
pyjanitor-devs

Description: Clean APIs for data cleaning. Python implementation of R package Janitor

View on GitHub ↗

Summary Information

Updated 1 hour ago

Added to GitGenius on September 24th, 2025

Created on March 4th, 2018

Open Issues & Pull Requests: 111 (+0)

Number of forks: 186

Total Stargazers: 1,500 (+0)

Total Subscribers: 15 (+0)

Issue Activity (beta)

Open issues: 67

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 1,792 days

Stale 30+ days: 65

Stale 90+ days: 58

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

good first issue (77)
enhancement (72)
docfix (68)
being worked on (62)
good intermediate issue (50)
available for hacking (49)
bug (18)
infrastructure (15)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 37.2 hours

Mean response time: 255.0 days

90th percentile: 932.0 days

Tracked items: 79

Most active contributors

samukweku - 108 events, 60 issues
ericmjl - 51 events, 30 issues
wjandrea - 18 events, 3 issues
raffaelemancuso - 10 events, 2 issues
jbbqqf - 7 events, 7 issues

Related by overlapping contributors

Detailed Description

The pyjanitor repository hosts a powerful Python library designed to streamline and standardize data cleaning and preprocessing tasks within the pandas ecosystem. Positioned as an essential toolkit for data scientists and analysts, pyjanitor extends pandas DataFrames with a rich set of intuitive methods, making common data wrangling operations more efficient, readable, and reproducible. Its core philosophy is to provide a "fluent" API, enabling users to chain multiple cleaning steps together in a clear, sequential manner, significantly reducing the boilerplate code often associated with data preparation.

At its heart, pyjanitor addresses the pervasive challenge of messy, inconsistent, or incomplete data. It offers specialized functions for a wide array of cleaning scenarios. For instance, `clean_names()` automatically standardizes column names by converting them to snake_case, removing special characters, and handling duplicates, ensuring consistency across datasets. Functions like `remove_empty()` and `coalesce()` provide robust solutions for managing missing data, allowing users to drop entirely empty rows or columns, or fill missing values from a sequence of columns, respectively. The library also includes powerful reshaping capabilities, such as `pivot_wider()` and `unpivot_longer()`, inspired by R's `tidyr` package, which facilitate complex data transformations for analysis.

Beyond basic cleaning, pyjanitor provides advanced utilities that enhance data quality and preparation. `get_dupes()` helps identify and inspect duplicate rows across specified columns, a crucial step in ensuring data integrity. For categorical data, `factorize_columns()` offers a convenient way to convert string-based categories into numerical representations, which is often required for machine learning models. The library also introduces functions like `fill_direction()` for intelligently propagating values in missing data based on directional logic, and `expand_grid()` for generating all combinations of inputs, useful for creating lookup tables or testing scenarios.

The library integrates seamlessly with pandas by leveraging the DataFrame accessor pattern, meaning pyjanitor's methods are accessed directly via `df.janitor.<method_name>()`. This design choice ensures that pyjanitor functions feel like native pandas operations, maintaining the familiar DataFrame object throughout the cleaning pipeline. This deep integration allows users to intersperse pyjanitor methods with standard pandas operations, creating highly customized and efficient data workflows. The emphasis on method chaining not only improves code readability but also promotes a more declarative style of programming, where the sequence of data transformations is explicitly laid out.

In summary, pyjanitor serves as a vital extension to pandas, filling critical gaps in data cleaning and preprocessing. By offering a comprehensive suite of well-designed, chainable methods, it empowers users to tackle complex data wrangling challenges with greater ease and confidence. Its focus on standardization, readability, and reproducibility makes it an indispensable tool for anyone working with data in Python, from initial data ingestion to preparing datasets for advanced analytics and machine learning. The repository reflects a commitment to building a robust, community-driven solution for one of the most time-consuming aspects of the data science lifecycle.

pyjanitor
by
pyjanitor-devs

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

pyjanitor
by
pyjanitor-devspyjanitor-devs/pyjanitor

Repository Details

pyjanitor by pyjanitor-devs

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

pyjanitor by pyjanitor-devspyjanitor-devs/pyjanitor

Repository Details

pyjanitor
by
pyjanitor-devs

pyjanitor
by
pyjanitor-devspyjanitor-devs/pyjanitor