8 skills you need to become a Data Engineer
Would you like to become a Data Engineer? Here are 8 important skill sets that you need to develop.
The world needs more Data Engineers.
With companies slowly realizing the value of data and finding new ways to acquire and use data to build strategies to improve their business, there is a huge demand for Data Engineers and Analysts. And this will keep increasing.
If you want to pursue a career in the field of Data, here are the 8 important skills that you need to develop.
A programming language
It goes without saying that, learning a programming language is very essential to work in any software development-related field. With a good foundation, you will learn to build processes, automate tasks, add tests and debug your code in an efficient manner.
Now if you don’t already have a choice for a programming language, you need to consider certain points before picking one. Ideally, for a Data Engineering position, you would be working with various kinds of file formats like CSV, excel, google sheets, XML, JSON, pickle files, and a lot more. You would also want to fetch data from various APIs, scrape web pages, and use the data for your analysis.
My recommendation for you would be to choose Python. Python has a lot of standard and third-party libraries that can help you in playing with data from various sources and automate a lot of the trivial stuff.
Also vacancies-wise, Python is in heavy demand among companies and you have a good chance of pursuing a successful career in this field.
I would highly recommend Automate the boring stuff with Python book if you are serious about learning Python. Click this link or the image below to buy it!
You can also check out our Data Engineering playlist to learn more about using Python for Data Engineering tasks.
SQL stands for Structured Query Language. It is the language used to interact with relational databases. If you are working for a company as a Data Engineer, you would be using SQL pretty much every day of your career. Building a solid foundation in SQL is important as you can easily save a lot of your time and effort if you can use its various aspects in an efficient way.
Now a lot of database engines have their own versions of SQL. But the good news is, if you learn the basic SQL syntax, it would still be super useful to work with most of the database engines. You might also want to learn NoSQL, as it is one of the booming technologies of our times. But if you only have time for one, go with SQL.
You can learn SQL from Mode. They have a really good tutorial explaining every aspect of SQL in a structured manner. They also have an online database for you to practice what you learn. Check out their tutorial here.
Some common tasks for Data Engineers include database design, monitoring performance, security, troubleshooting errors, backing up databases, and recovery of lost data.
When you are working with a database all day, learning the basics of database administration can come real handy. It will help you in taking the correct measures to maintain the integrity of a database. Understanding the various nuances of a database like Distribution keys, Sort keys, Primary keys, Foreign keys, and referential integrity can help you build an efficient database engine that serves various stakeholders in your company.
According to DB Engines, Oracle, MySQL, and Microsoft SQL Server are the top 3 popular databases in 2019. However, databases like Postgresql and MongoDB are rising a lot in popularity over the last couple of years. Pick one of the above and try to learn the basics of how the database engine performs.
A lot of companies are moving to cloud infrastructure these days. Providers like AWS, Google Cloud, and Azure are being used by companies for a lot of their BI and Machine Learning tasks. When you work with cloud providers, learning to use serverless functions can be highly useful.
Serverless functions are application entities that are invoked only when required. You don’t need to manage any servers to maintain or scale them.
From building POCs(Proof of Concepts) to deploying a production-grade ETL pipeline, Serverless functions can be used in a lot of ways. All major cloud providers like AWS, Azure, and Google Cloud provide the infrastructure to build and deploy serverless functions.
To learn how to build serverless functions in AWS, check out this playlist here.
Version Control (Git)
It is impossible to think of a software development process that doesn’t involve Version Control. There are a lot of tools like Git, Subversion and Mercurial. Git is probably the leading tool among all three.
When you work in a team, each of you will be working on a different feature. To get your changes into production, your code needs to be reviewed by other team members. Using version control tools allows you to keep working on your feature without coming in each other’s way. It also allows you to keep a history of the tasks you have done using commits so that you can refer to it later when in doubt or confusion.
You can learn Git using this interactive tool.
Continuous Integration/Continuous Delivery(CI/CD)
This is another benefit of working with Version Control. CI/CD allows you to deploy your changes to production by just pushing it to a repository. The CI/CD pipeline consists of build steps and tests. So every time you push a new feature or a bug-fix to your branch, all those build steps and tests are performed. Once the entire process is green, you can merge your changes to production.
Nowadays, all the major version control platforms come with CI/CD features for you to deploy your changes in a hassle-free way to production. Some famous CI/CD tools are Jenkins, Gitlab Runner and Google Cloud Build with Github.
Business Intelligence Tool
The ultimate goal for any data engineering task in a company is to provide business-relevant data to various stakeholders like Management, Sales, Product owners, Data Scientists and customer support. Those stakeholders are not going to run SQL queries on the database. They will be accessing the data through a BI tool like Tableau, Looker, Periscope, Power BI, Qlikview, Sisense, etc.
The landscape of data analysis has changed. Data discovery and anomaly detections are not done by only data engineering specialists anymore. Any business stakeholder can do it. This is the advantage that comes with using a BI tool.
Ultimately, all your data engineering tasks are going to be judged on the way your data shows up in the BI tool. Therefore it is of paramount importance to learn its specifics. You have to learn to create explorers, views to store and update business information, join relevant tables from your database to build a coherent business story, and finally maintain the various resources in your BI tool.
So pick any tool of your choice. If you want to go by popularity, pick either Tableau or Looker or Power BI.
MS Excel/Google Sheets
Now BI tools have their own place. But that doesn’t mean tools like MS Excel or Google Sheets are going to become irrelevant or unnecessary.
My experience has shown me that a lot of stakeholders still feel comfortable working with such tools as it provides them with flexibility and control over data entry. This flexibility is not there in BI tools. BI tools ingest transformed data provided by the data engineering pipeline.
Having the knowledge in one of these tools might actually earn you a brownie point in the eyes of your Managers !!
So these are the 8 essential skill sets required for you to get into the field of Data Engineering. Don’t be intimidated by the list. Take one step at a time and take it often. Keep learning new things and implement your knowledge on various practical projects.
Do you believe this list misses out on certain other relevant skills? Do let me know on Twitter.