Investing in data science skills for the long run

Advice
Author

Paul Simmering

Published

January 9, 2023

Data science is a field that is constantly evolving and requires a lot of practice to master. Picking the right skills to focus on is critical for career development.

The first distinction I see is between gig skills that are useful for a single project or job and long-term skills that benefit you for your whole career. Telling the two types apart will let you invest your time more effectively.

Long-term skills are generally a better investment than gig skills. Focusing too much on gig skills can turn you into a perpetual beginner. In every new job or project, you need to start from scratch learning skills that lose their value quickly. But investing in selected ephermeral skill can still be smart. It is often necessary to learn the particularities of a software system you’re working with to be effective.

Specialists turn gig skills turn into long-term skills

A skill that is gig skill for a generalist may be a long-term skill for a specialist. Imagine a data scientist that wants to provision resources on an AWS account. Their company uses AWS CDK for infrastructure as code. Learning the AWS CDK is a gig skill for the data scientist, because the next job may be at a company that uses Azure or Google Cloud. But for a dedicated AWS cloud data engineer, learning the AWS CDK is a long-term skill that pays off over and over.

Golden long-term skills

A few skills stand out as eternally valuable for anyone in data science. They’re likely to pay off for a whole career.

  1. Descriptive statistics and probability distributions: Understanding variance, quantiles, conditional probability and hypothesis tests is essential. These building blocks of statistics won’t change.
  2. Linear regression and its variants: These can answer many questions by themselves and also serve as a benchmark for more sophisticated machine learning algorithms. Understanding variants like logistic regression and regularized least squares widens the range of questions you can answer and deepens your understanding of machine learning.
  3. Data visualization principles: A visual expression of data makes the information more accessible. Knowing which visualization is suitable for different types of data and questions makes you a more competent communicator and multiplies the impact your analyses have. Note that I’m only referring to the principles as long-term skills. The plotting libraries come and go.
  4. Effective writing: Whether it’s writing a report, a proposal, a support ticket or an email: Writing with clarity boosts anyone’s effectiveness.
  5. SQL: This is the only language on the list. SQL is ubiquitous and has been in use for almost 50 years. Being able to access data at the source is essential for anyone working in the data industry. The SQL standard changes very slowly. Different databases implement variants of the language, but the core commands work everywhere.
  6. Git: Using version control is non-negotiable when working in a team and Git is the unanimous leading choice.

These are practical skills and overly difficult to get started with. They provide a great foundation that makes a data scientist useful in almost any project. If you are interested in research, go deeper and follow Yann LeCun’s advice:

You should study very basic things that have a long shelf life - mathematics, physics, basic computer science, applied mathematics. Those are things that would be necessary to understand and develop the next generation of AI system

Yann LeCun, Chief AI Scientist at Meta

Pick-one skills

This is a class of skills that are required in a wide range of data science projects, but that have many implementations of which only one is used at a time.

  1. A programming language: While it’s possible to analyze data entirely within a GUI, it severely limits what you can build. R and Python are the two top choices for programming data analysis. You can also program it with Julia, Java and many other languages, but R and Python have the most package libraries and widest support.
  2. A charting library: Creating visualizations with code makes them reuseable, reproducible and with some practice also quicker to make. There are endless charting libraries. Some popular examples are: ggplot2, matplotlib, seaborn, echarts.
  3. A machine learning framework: Examples are: scikit-learn, tidymodels, caret, mlr3. When using neural networks, one of Pytorch or Tensorflow is typically required.
  4. A data quality testing library: Examples are: pointblank, Great Expectations. You could also use constraints in a SQL database.
  5. A package manager: Examples are: renv, pip, poetry, conda.
  6. An orchestration platform: Examples are Airflow, Dagster, drake, dbt. Data scientists often don’t have to set these up and maintain them themselves, but need to know how to submit and monitor jobs running on them.

These can end up as gig skills. Throughout your career, you’ll likely have to switch between them, either because a new library outshines the older ones or because you join a team that uses a different one. Each switch incurs a cost of relearning. Thankfully, the different implementations often share principles. Learning your third visualization library will be much faster than the first. Smart hiring managers understand that these can be picked up on the job.

Vendor and project specific skills

  • Fine details of cloud platforms
  • Proprietary software that isn’t widely used
  • Internal tools not available to the public

These are most likely to become gig skills, unless you make it a career choice and specialize in them.

Domain knowledge

Knowing more about the subject matter behind the data you’re analyzing lets you ask better questions and avoid silly mistakes.

If you switch industries, previous domain knowledge loses its value. As an example, I’m currently not using any of the domain knowledge I acquired studying economics, while the statistical methods continue to be useful.

Some fields of data science require deep industry specialization and corresponding certifications, for example in health, biology, accounting, insurance and other highly regulated industries. This is a form of specialization in domain knowledge.

Thanks for reading! Do you agree with my skill categorization? Let me know on Twitter.

Photo by Nina Luong on Unsplash