And now for something completely different. Reflecting on my past year at UBC, one of the most valuable experiences I've gained was having the privilege of working with the amazing UBC BizTech team as the Creative Director.
One moment that particularly stood out to me was trying to answer a simple question posed to us by a potential sponsor:
"What is the composition of your club's membership?"
I was tasked to create a simple infographic visually demonstrating the demographics of our club. "Oh, this will be easy!" I thought. "Everything's all just in our form response spreadsheet so I'll just throw it onto Excel and let PivotTables work their magic!"
PivotTables could not save me from the data hell I was about to encounter. Just kidding, it wasn't actually that bad, but it was unexpected to say the least.
Above was our humble membership signup form. Innocuous enough to the untrained eye, it seemed like it would do its job fine. It was only when we were asked about our membership composition when we realized the troubles we've caused ourselves. Additionally, our membership went from around 60 from the prior year to about 300 members by the end of the year, which made this an added challenge (#humblebrag).
So you may be wondering what unrealized issues we designed our membership form with. It was something we threw together quickly just before the madness that is Business Week, and we were stuck with it the entire year.
Just a quick scroll through the raw response data reveals one of the big mistakes we made: keeping the "Year" field unstructured (leading to a mix of formats) and somewhat ambiguous for club membership newbies; I saw a few people fill in the current year (2015 or 2016) under that field.
Putting the data into a PivotTable further revealed the flaws in our form design.
A few observations I noticed:
- People aren't good at reading a form when you're watching them as they fill it in. Some people assumed the "Year" field was "Last Name" or "Student Number". A remedy for this would be to use unstructured fields as sparingly as possible, ratio boxes or a dropdown menu definitely would have worked better here. This would have also solved the "1" and "1st" issue I mentioned before.
- Having an "Other" field which allows for text entry means some people may go straight to the field without reading through the options. If cases outside your existing options isn't important, perhaps just having an "Other" ratio box would be better.
At first I considered omitting the data that didn't fit neatly into the categories we expected to see, but a significant amount of responses were under that category. I turned to a tool called Open Refine, which I learned about during a visual analytics course held by the Vancouver Institute for Visual Analytics, to clean the messy dataset.
Using Open Refine's text facet and clustering functions, I was able to easily combine groups of data together (turning all fields as "3rd" into "3" for example) to make the data much easier to analyze. My usage of it only scratches the surface of what it's capable of doing, as it's essentially a makeup kit for all kinds of data.
Corny metaphors aside, here's what the PivotTable of the cleaned set looks like after a few minutes in Open Refine.
Right away, we could easily make a stacked bar chart to visualize the data now that it's been prettied up:
Of course, this isn't going to be a good answer to the initial question: "What is the composition of your club's membership?" While it does answer it, not many people would be happy with a diagram with so many things going on, so I made a few additional charts and counts in Excel, brought them into InDesign to pretty them up further, and came up with this to hopefully answer that question.
The lessons here:
- think carefully when you're creating a membership form for your club. You'll likely be stuck with it for the entire year, so get a few people to test it before releasing it so that you're able to iron out the kinks before clubs week.
- Structured responses are almost always better than unstructured ones, unless things which fall outside of predefined categories are very important.
- PivotTables are amazing, but they're only as useful as your data is accurate/clean.
This is just the idea I came up with to clean the data, I'm sure there's probably hundreds of others out there, but hopefully this was helpful to you.