Data engineering is a fast-growing field in the world of AI and data. But you might be wondering, what exactly does a Data Engineer do? Here, we’ll shine a spotlight on the role of Data Engineer, based on information shared by industry coaches Nana Essuman and Femi Anthony during the Black and Brilliant AI Accelerator program. Nana is the Director of Data Engineering at Condé Nast, and Femi Anthony is a Lead Data Engineer at Capital One.
At a high level, Data Engineers play an important role in helping companies make data-driven decisions by collecting, transforming, and publishing data. Data Engineers work behind the scenes to create the databases that house a company’s data. They build pipelines that transform raw data into formats that are useful for Data Scientists. And they create the infrastructure that automates model building for machine learning and analytics.
“We work with Data Scientists to understand what they’re trying to build. And we help them set up the foundations. Then we work with them — when they’ve actually built the models — to take what they’ve built and deploy it,” Nana explains.
What is data engineering?
Data engineering involves creating the systems and maintaining the databases that store the data required for data science and analysis; using software engineering practices to automate the work of data cleaning, normalizing, and model-building so the data is ready to be used.
Femi explains one of the key differences between data engineering and data science: “It’s one thing to create the model. But to actually get it up and running in production, and doing what we want, and getting the reliability that we need — that’s where we need Data Engineers to scale and get the model running every day with very little trouble.”
“Data Engineers are really out there to automate and scale things, to essentially help take this to the next level,” Femi says.
What skills do Data Engineers need?
If you’re interested in a career in Data Engineering, you might be wondering what kind of skills you need to excel. Below are some of the specific skills that you’ll use in your work as a Data Engineer.
“These topics are what you would call a Data Engineer’s toolkit, in terms of which areas you want to get some sort of skill or mastery in,” Femi tells us. He adds that Data Engineers who are just starting out should not worry about mastering every skill, but that they “should have experience dabbling in all of them.”
Data processing
As a Data Engineer, a big part of what you’ll be doing is sourcing and processing data. This can include filtering for the right data sets, checking data formats, and bringing in data through processing methods like batch or stream. “This essentially is the meat of a data engineering system,” Femi says.
Data storage
Making sure your data is formatted the right way, and stored in the right place, can impact how quickly someone can access or read your data. “Storage and formats can make or break how your downstream system utilizes your data,” Nana says. He adds that for aspiring Data Engineers, knowing the pros and cons of different file formats, based on how people are going to access or query that data, is a great place to start.
Databases
Depending on where you work, a Data Engineer might work with SQL databases (like PostgreSQL and MySQL) or NoSQL databases, which are becoming ever more popular for their scalability. Nana explains: “Some companies like to store model scores in NoSQL databases because if they need to add features about a customer, or more scoring details about a customer, they can do that much faster.”As a Data Engineer, you may also work with data warehouses, which are special databases geared towards analytic queries — the kind of queries commonly used by Data Scientists and Data Analysts.
Containerization
Nana shares that when it comes to containerization, “there’s one word, which is reusability.” As a Data Engineer, you’ll want to know about containers, which allow you to package what you build and move it from one environment to another. You’ll also work with tools that help you deploy these containers and scale out your infrastructure.
Caching
If you’re working with systems that need to be highly responsive — like Uber, which shows real-time pricing data — you’ll want to know about caching data in memory.
Femi explains caching in more detail: “You don’t want to go to the database every time because it’s slow. You can take the most frequently used items or pieces of data and put it in memory. You can query that cache, and the response time would be much faster.”
Machine learning frameworks
“This is where you have an overlap between the Data Engineering world and the Data Science world,” Femi tells us. He adds that if you want to be able to work with Data Scientists to help productionize their models, or to provide a platform where they can train and scale their models well, you’ll want to become familiar with machine learning frameworks.
Femi shares that some of the most popular machine learning frameworks today include TensorFlow and Scikit-Learn. “If you start off with Scikit-Learn, and you pick up TensorFlow later, I think you’ll be in good shape,” he says
How much do Data Engineers make?
Data Engineers play a huge role in helping companies make the most of their data, which is one of the reasons they’re in high demand, ranking 7th in Glassdoor’s list of the best jobs in 2022. They’re also paid pretty well, earning $126,625 on average in the U.S. — not including bonuses, stock options, or other perks.
Getting started as a Data Engineer
For aspiring Data Engineers who are just starting out, Femi shares this advice: “I would learn Python, take some introduction to data science courses, and make sure you have a GitHub presence. There are lots of open data sets out there. Use the skills that you’ve learned to come up with some sort of project that you can put on GitHub and advertise your skills. That way, you’re reinforcing what you’re learning. And because you’re learning by doing, you get the actual practical experience.”
If you’re leaning towards becoming a data engineer, we recommend checking out Learn SQL for learning how to query databases effectively, Learn Python to start building pipelines for your data, an Design Databases with PostgreSQL to create your own databases from scratch. We also recommend the Data Scientist career path, which is designed to give you all the skills you need to work in the world of data.