Data registries, such as T1D Exchange, advance data-driven innovation by compiling data from centers across the nation and using them to answer complex questions and develop strategies to improve patient outcomes. That requires member institutions to map, transform, and validate their data – a challenging and detail-oriented task. Our institution was able to reduce submission errors and improve data quality by using the open-source framework Pandera (Bantilan, 2020) to validate data before submission.
Pandera is a lightweight schema and data validation framework built in Python (3.7, 3.8, 3.9). It allows users to define a schema for their data and to specify a wide variety of data quality checks. Our team translated the mapping documentation provided by the T1D Exchange to data tests in Python using Pandera. This validation step was added after data extraction and mapping, before any data was submitted to the T1D Exchange.
In the submission prior to using Pandera, the T1D Exchange reported 26 data schema and validation errors back to our institution. In the first monthly submission after adding Pandera schema validation to the workflow, only one error was reported.
Using Pandera to add a schema and data validation step to the data extraction, mapping and validation pipeline has reduced the number of errors per submission. It also provides a flexible framework that can be adapted as changes to the requirements are made by the T1D Exchange. Because it is built using open-source tools, it can also be easily shared with other member institutions.
Data Quality; Data Processing, Automatic; Data Sharing
Bantilan, N. (2020). pandera: Statistical Data Validation of Pandas Dataframes. Proceedings of the 19th Python in Science Conference, 116–124. https://doi.org/10.25080/MAJORA-342D178E-010