More and more startups are looking to hire data scientists who can work autonomously to derive valuable insights from data. In principle, this sounds great: engineers and designers build the product, while data scientists crunch the numbers to gain insights. In practice, finding these data scientists and enabling them to be productive are very challenging tasks.
Before diving further, it's useful to note a few trends in data and product development that have emerged over the past decade:
Companies such as Google, Amazon and Netflix have shown that proper storage and analysis of data can lead to tremendous competitive advantages.
It’s now feasible for startups to instrument and collect vast amounts of usage data. Mobile is ubiquitous, and apps are constantly emitting data. Big data infrastructure has matured, which means large-scale data storage and analysis are affordable.
The widely adopted lean startup philosophy has shifted product development to be much more data-driven. Startups now face the challenges of defining Key Performance Indicators (KPI), designing and implementing A/B tests, understanding growth and engagement funnel conversion, building machine learning models, etc.
Because of these trends, startups are eager to develop in-house data science capabilities. Unfortunately many of them have the wrong ideas about how to build such a team. Let me describe three popular misconceptions.
Misconception #1: It's okay to compromise the engineering bar for statistical skills.
For data scientists to work productively and independently, they must be able to navigate the entire technical stack and work effectively with existing systems to extract relevant data. The only exception is if a startup has already built out its data infrastructure. But in reality, very few startups have their infrastructure in place before building a data science team. In these cases, a data science team without strong engineering skills or engineering support will have a hard time doing their job. At best, they will produce suboptimal solution that will be rewritten by another team for production.
To illustrate this, take the example of building the KPI dashboard at Codecademy. Before visualizing the data in d3, I had to extract and join (a) user data from MongoDB, (b) cohort data from Redis, and (c) pageview data from Google Analytics. The data collection alone would've been near impossible without an engineering background, let alone making the dashboard real-time, modularized and reusable.
Misconception #2: It's okay to compromise the statistics bar for engineering skills.
Proper interpretation of data is not easy, and misinterpreted data can do more damage than data that's not interpreted at all (check out "Statistics done wrong"). Building useful machine learning (ML) models is trickier than most people expect. A popular but misguided view holds that ML problems can be solved either by applying some black box algorithm (a.k.a magic), or by hiring interns who are PhD students. In practice, hundreds of decisions and tradeoffs are made in solving such problems, and knowing which decisions to make requires a lot of experience. (I’ll expand upon this more in a future post titled “Machine learning done wrong”.) For a given ML problem, there are tens if not hundreds of way to solve it. Each solution makes different assumptions and it's not obvious how to navigate and identify which assumptions are reasonable and which model should be used. Some would argue: why not just try all different approaches, cross validate, and see which one works the best? In reality, you never have the bandwidth to do so. A strong data scientist might be able to produce a sensible model right off the bat, while a weak one might spend months optimizing the wrong model without knowing where the problem is.
Misconception #3: It's okay to hire data scientists who lack product thinking.
Imagine asking someone who doesn't have a holistic view of the product to optimize your business KPIs. They may prematurely optimize the sign up funnel before making sure the product has reasonable retention, which would lead to more unretained users. Some think data-driven product development is a local optimization. This criticism is only correct when those who drive product development with data fail to think about the product holistically.
To sum up, a productive data science team requires data scientists that are strong in engineering, statistics, and product thinking. It's hard. And it becomes even harder to look for the first data science hire who will be spearheading data efforts in a startup. For startups that don't have the luxury to wait and hire these rare data scientists, it's important to be aware of the compromises made especially in terms of the hiring bar. Before the data team is strong enough across all three areas, make sure they have strong support for the skills they lack, and don't expect them to work autonomously.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to "ML in the Valley". Also, special thanks Ian Wong (@ihat), Leo Polovets, and Bob Ren (@bobrenjc93) for reading a draft of this.