reader reading this post

How to Become a Data Engineer

There seems to be a slew of blog posts out in the internet about how to become a Data Engineer but I have yet to see one from a actual person working as a Data Engineer.

Just to prove a little more to what I am saying here is a google search I did recently for this post:

All of the results comes from either massive open online courses trying to make money out of people who want to become a Data Engineer or the random companies that seem to have Data Engineering services for hire. Well, I think its' time to shine some light on the secret on how to became a Data Engineer.

Ready? Alright then, without any further ado:

There is no single best path or the one and only path to become a Data Engineer!

via GIPHY

Yes... you read that right. There is no single best path or the one and only path to become a Data Engineer. One reason for this is how new the field is when compared to other areas of computer science which in reality is still a new field when compared to other engineering areas (electrical, mechanical, etc.). Another reason why becoming a Data Engineer is not as straight forward is that the needed vs wanted skillset is hard to pinpoint and most companies have different needs and wants but do not know how to hire for them. Therefore, job descriptions are actually a copy of another job advertisement that a hiring manager or human resource specialist found online from another company and they modified it to fill in the specifics needs and wants for their own team/company.

via GIPHY

I think I just gave away managements secret on how to search for Data Engineering talent...

Though the previous paragraph might confuse you on how to become a Data Engineer there is a consensus from companies by copying their job ads from another. The consensus of skills most jobs ask for include knowledge in:

  • Experience in programming languages like Python,Spark,SQL, or Scala
  • Knowledge or experience on cloud services like AWS, Azure, and Google Cloud
  • Knowledge in data warehousing, database optimization or schema creation and other ETL tools like Apache Airflow

Nice to haves:

  • Can build systems on cloud services using languages like terraform or YAML scripts
  • Statistics knowledge or training

Of course the above points are generalized and some job posting might differ in the tools they want but most of the job ads revolve around the previous bullet points in their description.

But wait... I work as a Data Engineer and I do not have knowledge or experience in statistics and frankly I do not think you need it because all we do at my job is to spin up systems, move data around to them and from them, and make sure other people can use it to create ETL or reports. 

Or:

I work as a Data Engineer and all I do is python scripts and SQL everyday to create tables, reports or dashboards and I help data analyst/scientists to get their data so they can carry out their job tasks. Plus, I am in charge of the databases and their performance.

-Random Data Engineers reading this post

Because Data Engineering is so new and the companies think differently about what they want or need in their team skillset when copying job ads from one another there seems to be to major areas focus. The first one is a Data Warehouse focused Data Engineer better known before as a Database Architect or Business Intelligence Reporting focused Data Engineer or a Cloud Infrastructure focused Data Engineer AKA a Cloud Architect with data manipulation experience. Does this mean a D.B.A. or analyst can change their title to Data Engineer and reap the benefits of a new field and higher salary requirements? No, not for most people. In essence Data Engineers should strive to be well rounded individuals that can perform multiple tasks in multiple areas of computer science.

Thus the saying of:

Jack of all trades. Master of none but better than a master of one.

Fits Data Engineers perfectly. 

So then, How do I become a Data Engineer? Well to start Data Engineers should have fundamental knowledge in computer science at a bachelors or masters level. They should be experienced in 2 or more programming languages. One multipurpose language like Python and the second one should be SQL followed by Apache Spark (Yes in that order). They should be exposed personally or through work to cloud services like AWS or Azure and how to spin up and use systems securely while making the systems fast and keeping costs low. They should be aware of data base technologies and how to optimize for those. Moreover, they should code as much as they can their work instead of manually inputing things into the systems and should be able to jump in to help analysts, scientists and SRE's with their work in case something went wrong with their data pipelines.

TL;DR

To become a Data Engineer you should have knowledge and experience in:

  1. Computer science fundamentals
  2. Multipurpose programming languages
    1. Python
    2. SQL
    3. Spark
    4. Others
  3. You should have knowledge in cloud services
    1. AWS
    2. Azure
    3. Google Cloud
    4. Others
  4. ETL Tools
    1. Apache Spark
    2. Apache Airflow
    3. Other ETL tools
  5. Should know how to optimize queries in:
    1. Postgres
    2. Hive
    3. MySQL
    4. MSSQL
    5. Others
  6. Coursework or experience in
    1. Cleaning data for statistical computations
    2. Mapreduce
    3. Statistical modeling
    4. Machine Learning modeling
    5. Hyper Parameter Estimation
    6. Explaining how algorithms work to stakeholders or to understand what your analyst/scientist is saying.
  7. Knowledge of cyber security best practices
  8. Experience in tools Terraform to spin up cloud services
  9. Monitoring and alerting of production systems

To elaborate a little more on my recommendations, I say everyone should have computer science fundamental because with it you will be able to optimize things in a better way and understand the complexities involved when the business wants to make sure your data pipeline can respond to 5,000 requests per second or why the SQL script that you created for the report is taking longer every day.

For programming having a multipurpose language like Python will save you when the client has a weird requirement in their data and you have to code around it and SQL will just not cut it and Spark was not made for that type of problem (e.g. fixed width files with slow data change capturing). Thus I recommend you learn the languages in that particular order since most people when using spark actually code in Spark-SQL.

For the coursework I say get a masters degree or an online MOOC (not likely to be good) that can teach data science concepts in detail. A lot of software engineers and analysts think that by just importing SciKit learn or using a deep learning model that they can solve all business problems thrown at them with the added benefit that it all sounds fancy enough that stake holders will be happy to boast about the buzzwords. But in time, the big data algorithm will more difficult to maintain and to also explain its output. After a while, stakeholders will lose confidence in your convoluted answers when they ask how it works and why you chose that algorithm other another fancy model that they learn about in another blog post. Thus, getting a higher level degree in analytics will teach you that the more simple the math that accurately predicts or explain a problems, then, the more powerful that mathematical model is.

I think i just saved you thousands of dollars in tuition...

via GIPHY

Learning this skills in this particular order should make you a well rounded Data Engineer. Do not worry if you do not have a certain skillset now. Remember that having a career means constant learning and this constant learning is what we must always strive to do in order to to work in the most smart and efficient way possible.

In all, the path on how to become a Data Engineer is a big one if you are starting from scratch but it is doable and if you are already working as one or plan to transition into a data engineering role I hope this article helps you on finding what things you need to learn in order to be an awesome Data Engineer.