The sebastianruder/nlp-progress repository is a comprehensive resource designed to track and document the progress in the field of Natural Language Processing (NLP). Its primary purpose is to provide an up-to-date overview of the state-of-the-art (SOTA) results across a wide array of NLP tasks, along with the datasets used for evaluation. By aggregating benchmark results and dataset information, the repository serves as a valuable reference point for researchers, practitioners, and anyone interested in the advancements and current best practices in NLP.
The repository is organized by language and task, with English receiving the most extensive coverage. For English, it lists dozens of core and modern NLP tasks, such as automatic speech recognition, constituency and dependency parsing, coreference resolution, information extraction, machine translation, sentiment analysis, question answering, summarization, and many more. Each task is linked to a dedicated markdown file that details the relevant datasets, the best-performing models, their scores, and references to the original papers or sources. Where available, links to code implementations are also provided, distinguishing between official and unofficial sources.
Beyond English, the repository includes sections for other languages such as Vietnamese, Hindi, Chinese, French, Russian, Spanish, Portuguese, Korean, Nepali, Bengali, Persian, Turkish, German, and Arabic. For each language, it highlights the most prominent NLP tasks and provides similar information about datasets and SOTA results. In some cases, it also points users to external resources or leaderboards that are regularly maintained, ensuring that users can access the most current information.
A key feature of the repository is its collaborative and community-driven approach. Contributors are encouraged to add new results, datasets, or tasks by editing the relevant markdown files directly on GitHub. The repository provides clear guidelines for contributions, emphasizing the inclusion of results from published papers and widely-used datasets. Contributors are also guided on how to add code links, describe datasets, and maintain the structure and quality of the documentation.
The repository maintains a wish list of tasks and datasets that are not yet covered, inviting the community to help fill these gaps. Additionally, it offers instructions for exporting the data into a structured, machine-readable JSON format, which can be useful for further analysis or integration into other tools. For those interested in building the site locally, instructions are provided for using Jekyll to generate the website from the markdown files.
Overall, sebastianruder/nlp-progress is an essential resource for anyone seeking a centralized, well-organized, and regularly updated overview of progress in NLP. It supports research and development by making it easy to find benchmark datasets, compare SOTA results, and access relevant code, thereby fostering transparency and accelerating innovation in the field.