When I took an empirical finance in graduate school [redacted] years ago, the course was conducted in SAS. I’ve never used SAS since: it’s too expensive!
That same course today is conducted in Python, which is a free and open source tool. Even the starched-collar, highly-regulated finance industry has opened up to them.
But I propose that companies go a step further by not just consuming open source data tools, but producing them too. Here’s an example of what I mean:
Case study: Goldman Sachs
The below screenshot is from a CNBC article about Goldman Sachs “giving some of its most valuable software to Wall Street for free.”
Goldman Sachs isn’t known as the most generous firm, so what gives?
Let’s read between the lines on the third “key point” to understand: “We’re using Alloy because it radically reduces the costs of wrangling disparate datasets and sources of data together,” the quote says. If you’ve worked with data, that line probably hit you in the fillings: wrangling disparate sources of data is what you do as a data cruncher, and it kinda sucks.
If Goldman has found a way to radically reduce the cost of doing this, it seems like they’d want to keep it to themselves, right? But they are doing the opposite! The reasons why show the powers of open sourcing data tools and processes as an organization.
Of course I am not expecting every organization to be Goldman Sachs, but I do think the most sophisticated data strategies are going to have some kind of open source process to them. Why not? It’s relatively cheap, and it’s great content marketing.
Benefits of open sourcing your data tools
Other organizations who have open sourced their data software include Google’s TensorFlow and Facebook’s PyTorch. I know, I know: these are giant (and kinda sketchy) companies. But regardless of the size and aims of your organization, open sourcing elements of how you work with data has numerous benefits.
It’s good content marketing
I suggest every data analyst learn a bit of content marketing, and this case study shows why. Goldman has turned its code base into a source of marketing. It didn’t need to run ads or create a new institute: it simply had to package the software it already had. Content marketing is such an effective, practical way of getting the word out of what matters to you, and what you can do. This can be done by open source software like this, writing general how-to blog posts, and so forth.
Regardless of the medium, content marketing does best when it’s presented to a particular audience; and there’s no audience better for a data-cleaning code base than wrangling-wary data jockeys.
It signals to data candidates
A data candidate’s worst fear is to come into a company with terrible data quality and processes. It’s such a terrible prospect that I encourage analysts to ask about it straight away during the interview.
Interviews carry an incredible amount of asymmetric information. Wouldn’t it be great if, instead of hearing secondhand about the organization’s data processes from a manager with a vested interest, you could take a look at them before you even landed in the recruiting pipeline?
Well, it’s better for the candidate, at least, and to the on-the-ball organization. Candidates get more transparency into what they’ll actually be tasked with, which makes accepting the offer less of a gamble.
It’s the highest on the pyramid
It sounds hokey to bring up “creativity” in a case study on Goldman Sachs, but don’t you think its analysts may need to learn a thing or two to do their job well?
Bloom’s Taxonomy claims that the highest form of learning is creation. Even data jockeys prove their mastery by creating models and code bases. The Goldman example takes it a step further by releasing this software to the broader community of financial analysts. This is great marketing, a great signal for candidates, and it also makes learning a social process.
Releasing that code base isn’t just good content marketing, it’s good for the social well-being of the organization too.
Taking your repos public
Data is private and proprietary, but that doesn’t mean the tools used to analyze that data should be.
Most organizations have already moved from expensive proprietary tools like SAS to free, open source ones like R. The advanced organizations, however, have even moved to producing open source tools, not just consuming them.
It’s another angle to consider in the quest for data up-skilling: cultivate top talent, release solid tools for free, and watch more top talent take notice.
Leave a Reply