Earlier this week I posted a new article in the journal Nature titled “Six tips for better spreadsheets.”
Today I’d like to share a 2018 article in The American Statistician along the same (grid)lines… “Data Organization in Spreadsheets.” You’ll find repeated themes here, such as the principles of consistency, keeping a raw data source, and so forth.
This article goes into a bit more depth with some important topics like dates and missing values. Each concept:
- Be consistent
- Choose good names for things
- Write dates as YYYY-MM-DD
- No empty cells
- Put just one thing in a cell
- Make it a rectangle
- Create a data dictionary
- No calculations in the raw data filters
- Do not use font color or highlighting as data
- Make backups
- Use data validation to avoid errors
- Save the data in plain text files
Similar to the Nature article, I noticed that Power Query enforces many of these good habits such as keeping a raw data source and using a consistent value for nulls.
My personal favorite rule is to “make it a rectangle.” The article does a great job showing how data should be laid out in a spreadsheet for ease of storage and analysis. To the untrained eye, it looks repetitive and clunky! But the analyst knows a long and narrow rectangle beats a short and wide layered square.
What do you think of these principles and the article? Anything to add to this list?
Leave a Reply