Thinking about learning data science but not sure where to start? A question we hear a lot is: “What data science languages should I learn first?” We spoke to Sophie, a Curriculum Developer here at Codecademy, to answer this question.
A good first language will help you get started with learning the foundations of data science. “Learn a language to understand the concepts and core data science skills, like aggregating data, running hypothesis tests, and working with machine learning models. Once you know the core concepts, you can easily pick up other languages,” says Sophie.
In this article, we’ll look at data science languages that are commonly used today. We’ll do our best to set you up with what you need to know to choose the right data science language for yourself.
Finding the best data science language for your goals
Before we dive in, there are a few questions you’ll want to consider. What projects do you want to work on? What topics are you interested in? What industries do you want to go into?
Your answers to these questions can determine what language you’ll need to learn. Certain industries — such as healthcare or the government — can require you to know a specific language. That’s because you might be working with industry-vetted data science models or tech stacks that are built out using a certain language.
A good first step might be reaching out to folks in the industry that you’re interested in, and asking what languages they use and what they recommend starting with.
If you want to go with one of the more popular data science languages, we suggest checking out Python, R, and SQL. These are the most recommended languages for aspiring data scientists to learn first, based on the experience and research of our curriculum team.
It’s important to know that you can’t really go wrong in choosing your first data science language. “The key is not getting locked into one language. Once you know how to extract insights and values from data using one language, you can use that knowledge to easily learn another language. Being able to move between languages will help you become a versatile data scientist,” Sophie says.
Popular data science languages to choose from
Check out the following list of most popular data science languages to learn more about them and what they’re used for.
Python is a versatile, general-purpose programming language. It’s a favorite among programmers for its concise and easy-to-read syntax. With tons of powerful libraries and packages, Python can implement many of the statistical models and calculations required for data science. It’s also one of the best languages for scraping data off the web.
“Python is a good choice for data science if you’re already familiar with the language,” Sophie tells us. Many companies use Python for data science because their programmers are already using the language for other purposes. Python also uses intuitive and simple syntax, so it is beginner-friendly for learning important general programming concepts such as loops and functions.
One slight downside of Python as a first data science language is that the base installation of Python doesn’t come with statistical functions, so you’ll need to download separate packages to start doing any data science. But once you get set up, Python can be an easy language to learn.
Some good data science packages to know for Python are:
- Data manipulation: pandas and NumPy
- Visualizations: Matplotlib and seaborn
- Hypothesis testing and model fitting: SciPy, scikit-learn, and statsmodels
R is a statistical programming language built for data analysis, data visualization, and data science. It comes with a comprehensive set of built-in statistical functions and methods.
“R is a good choice if you’re new to data science, but already have some basic understanding of general programming concepts” Sophie tells us. The data structures, variable types, and analysis tools in R are straightforward and built specifically for data science. You don’t have to get bogged down with syntax or multiple different libraries when you’re just getting started.
With the base installation of R, you’ll be able to access many data science functions, like linear regressions or t-tests, and create beautiful graphics and visualizations. R also pairs well with RStudio — an integrated development environment (IDE) — which makes it easy to run R code and inspect the output.
Some useful packages to know about as you're first learning R include:
- Collection of data science packages: tidyverse
- Data manipulation: dplyr (also contained in tidyverse)
- Visualizations: ggplot2 (also contained in tidyverse)
- Classification and regression: caret
SQL (pronounced “sequel”) is a language that allows programmers to communicate with databases to manage the data they contain. It’s commonly used to query and edit the data stored in a relational database.
Typically, data scientists will extract data from a database using SQL and then import that data into R or Python for analysis. “No matter what language you learn for data analysis, SQL is important to learn if you want to pull data out from databases,” Sophie says.
Knowing SQL allows you to work with PostgreSQL, SQLite, MySQL, and other relational databases. The syntax for basic queries is similar across different databases, making SQL a versatile language for this purpose. To learn more about the different types of databases, check out our article on Relational Database Management Systems.
More data science languages
Depending on what industry you go into, you might need to learn a specific language for data science. Check out the following list of languages to learn more about what they’re used for.
- C/C++: Both C and C++ require a strong understanding of coding fundamentals, and can take more time to learn. When combined with Python or R, C/C++ can be used to perform computations on datasets with more speed and efficiency.
- Java: Many enterprise systems are built on Java back ends. If you’re already working with Java, you can integrate data science methods right into your existing codebase.
- MATLAB: Ideal for advanced numerical computation and for tackling complex mathematical and statistical problems. MATLAB is widely used in academia for teaching mathematics, physics, and engineering.
- SAS: Built for advanced analytics, business intelligence, and predictive analytics. SAS is commonly used in the health sciences, banking, and insurance.
- Stata: Used in economics research, public policy, and the social sciences. Stata is designed for anything from simple descriptive analysis to complex statistical modeling.
- Scala: A powerful language able to handle large amounts of data. Scala runs on the Java Virtual Machine, which means it integrates well with Java programs.
- Julia: A newer programming language designed for numerical analysis and computational scientific analysis. It’s useful for applications in physics, chemistry, astronomy, engineering, bioinformatics, and more.
Getting started in data science
Ready to start your journey into data science? Our Data Scientist Career Path and Data Analyst Career Path will take you through everything you need to know to start a career in data science, including how to use Python and SQL.
Our Data Scientist Career Path will take you through everything you need to know to start a career as a Data Scientist, including how to use Python and SQL to analyze data, communicate your findings, and draw predictions using machine learning.
Our Data Analyst Career Path will set you up with the tools you need to become a Data Analyst, including how to use Python and SQL to acquire, clean, and analyze data, plus communicate your findings.
If you have a specific language that you want to get started with, check out our Skill Paths:
Whichever language you end up choosing, we’re excited you’re getting started with data science and we wish you all the best on your journey!