"What knowledge is lacking to act effectively against a surge in global air pollution?" - that is the question that kicked off a research project a while back and, for me personally, lead to a long-lasting struggle revolving around the accessibility and usability of data. With more and more research being done, it also becomes more and more challenging to obtain the "right" data for a project. In the field of global air pollution, there are vast amounts of data available. Finding the relevant data, selecting it, and then integrating it were the main hurdles that had to be overcome. Here, I want to share three of my insights in the process of developing an own dataset on air pollution from global coal power generation.
1. Document data quality - really!
Finding data is often difficult. For our air pollution project, that was not the case. Numerous data sources along the whole coal power supply chain had to be combined so data selection became the true challenge. As the raw data showed overlaps, it was important to know the quality of the data sources. Uncertainties for pollutant measurements for example were often lower than for modeled pollutant releases so it made sense to prioritize less uncertain data. Unfortunately though, documentation of dataset quality was missing in several cases. A full uncertainty assessment may be outside the scope of many studies, but I realized that even a qualitative indication of data quality can be extremely helpful when comparing or combining data from different sources.
2. Provide unneccessary details
During the project, my colleagues and I had to find creative ways to deal with data gaps. Only once our basic air pollution model was set up, it became fully clear how important each data point was for its outcomes. For example, a central parameter was the fuel consumption that allowed to fill gaps where electricity generation was not reported. Fuel consumption data, in turn, was often also unavailable, but it could be back-calculated from reported carbon dioxide emissions due to the carbon balance. This is one case where a single data point per power plant helped to improve our result quality dramatically in an unforeseen way. Benefiting from this kind of proxy data is only possible when researchers publish the detailed data they have collected - even when its usefulness for others as well as the direct relevance for their own research outcomes may not be clear immediately.
3. Share ugly code
When reading articles about power plant models, I struggled to translate model descriptions into code despite clear and transparent documentation, while a few code snippets here and there sped up the model development a lot. And when my colleagues and I had to describe our own model code, we realized how difficult it is compromise between level of detail and understandability in such a case. A good general solution could be to provide the model source code as extended documentation for a paper - even when the code is written in a pragmatic way without the help of a professional programmer. At least in the field of sustainability research, this is still the exception rather than the rule. From my experience, the benefits of sharing code are numerous, though: research becomes more transparent, feedback can help to improve the models and double work can be avoided. Why not give it a try?
Finally, I want to encourage you, the reader, to share your views. Do you agree or disagree with these suggestions? Have you made similar or other experiences? Based on what we have learned during our project, my colleagues and I try to provide a large amount of data in the supporting information of our paper, and we make more data and code available externally. Be invited to have a look and tell us what to improve. Constructive criticism is what makes us learn most.
Title photo: kamilpetran/ iStock