From botched dates to missing records and more, poor spreadsheet management has led to major problems in scientific research. The prestigious journal Nature last week shared six tips for better spreadsheets.
And to address the DOn’t UsE sprEAdshEEts, UsE PythOn peanut gallery … from the article: “Data scientists express strong feelings about using spreadsheets for data analysis. On the whole, they prefer programming languages such as R and Python, in which analyses are more easily documented and more reproducible. But many researchers are more comfortable with spreadsheets, and being shamed for using them is counterproductive, says Labou.”
What principles would you add for a scientist looking to avert a spreadsheet-induced calamity?
🍅 Keep raw data, raw: while it’s tempting to ride roughshod over the original data and make ad-hoc changes, these can be very difficult to audit and cause an even bigger mess later on if the changes don’t serve the purpose of the analysis.
💻 Make data machine-readable: rather than organize your data with borders, fill colors or other formatting, try to encode all the data necessary to use the file such that it could still be used if it were converted into CSV, for example.
🔁 Be consistent: Use the same value to represent missing values throughout. Don’t label something as Ohio in one place and OH in another.
📄 Document your work: Have a log of exactly what was done to the data, in what order.
✌ Cross-check your data: Include error-checks to audit whether changes to have been made to row counts, column totals and so forth.
💭 Think ahead: The best cure for dirty data is prevention. Put processes in place so that data is collected and maintained in a usable format.
What struck out to me is how much Power Query in particular can do to solve for so many of these woes, whether it’s the need for an applied steps log or a consistent NULL value. That said, Power Query is not a substitute for sound data collection methods.
Leave a Reply