Beginners road map to learning Data Science (from an ex-school teacher’s perspective with a MSc in Data Science)
In 2014, I became a computer science teacher within secondary and sixth form schools. While you could say it was fantastic teaching the students’ new skills and seeing them grow and go from no coding experience to be confident programmers was indeed a spectacle to behold. However, I personally found my own growth within the computing world became very stagnant. Due to time restrictions within a typical teaching environment, I was not truly able to develop my own skills past what was expected at GCSE and A-Level. That led me to set goals within the summer holidays to try something new within the computing world. One summer, I learnt how to do Apple’s Swift programming language, as a bit of an Apple fanboy, and the following summer, I decided to look into learning machine learning (ML). I thought it would be as straight forward as importing a library as I was aware Python was one of the main languages powering it, and with using the language daily in the day job I thought, why not? However, how wrong was I.
I soon discovered that ML was a part of programming that came under the umbrella of data science. Still, when I started to do online courses about it, I was baffled about why everything was based around statistics and probability. I thought I could just import ML into a script and away I go, a very naive view of a black-box approach. I have soon realised that ML actually is a toolbox of tools and the art in using ML is knowing what ML tool to use for the right job. Uncovering this knowledge led me to leave teaching and pursue a masters in advanced computer science, which focused heavily on data science and ML. This has now lead to me doing a PhD in human centred artificial intelligence, which I wouldn’t have ever thought would have happened, but data science is an exciting topic.
So this article is for people in the position I was in before I started my MSc, wanting to start doing ML but a bit overwhelmed with all the advice and didn’t know where to start. This list of the critical concepts you need to know to get started but doesn’t involve the maths. From my experience, a lot of the “learn data science in X number of days”, focus a lot on the maths behind ML, while this is helpful to know, I believe it is not needed to get started. However, to master it, most definitely.
There is a lot of debate around about what programming language people should learn. Is it R or Python? In my opinion, it shouldn’t be one or the other, it should be both as they both serve their own purposes. However, for starting out, I would strongly recommend Python. It is not only because of its friendly and clear to read syntax that is often suggested but because of its vast versatility. By learning Python, you can also make desktop applications or web-based apps. It can indeed do a lot.
If you are starting from no programming experience, I suggest you start with the basics of understanding what sequential (line by line), iteration (loops) and conditions (if statements) are first. As well as know your different data types and structures. These include strings, integers, floats, lists, tuples and dictionaries. However, I would do dictionaries last as its very helpful when you start looking at handling large datasets, but it is probably the trickiest to start off with.
Once you have mastered the previous suggestions, I would strongly recommend understanding functions and procedures. The features are critical when looking to improve your programming ability and readability and reusability of your code. Allowing you to create cleaner and more flowing code. Once you have understood functions and procedures, I would recommend understanding object-orientated programming (OOP) concepts. While I wouldn’t say it’s essential to confidently create OOP applications, I think it is vital that you understand the theory of what OOP is doing. This will help a lot when you are starting to look at big libraries that are listed next.
The library that I would suggest that you first explore is the Pandas framework. This framework will allow you to interact with the different datasets, manipulate them and visualise them. While there are other frameworks available to create visualisations, such as Matplotlib and Altair, which can offer more functionality, Pandas will allow you to create, amend, delete, and import your data while also visualising it. I feel it is a great place to start and an excellent book to help you with this is Python for Data Analysis. The next library I recommend that you start looking into is then Sci-Kit Learn. This is now going into the realms of ML. I would suggest that you look into clustering like K-Means and Gaussian Mixture Models (GMM), regression like linear and logistic. At this stage, I feel it is not needed to know how everything under the hood is working, but it is good to understand theoretically what the algorithms are doing under the hood. An excellent place to learn this is a youTube channel called Stats Quest.
Once you have mastered these, it is worth looking into TensorFlow and learning about the different neural networks. However, to get to this stage will take some time. I would suggest at least 6 months but remember, its a marathon and not a race. Make sure you understand the concepts well, and you are feeling comfortable working within these environments. While there is no need to remember everything of the top of your head, as that’s what Google is for, it is recommended that you are at a level that you can read someone else’s code and be able to get the information you need from it.
- Functions/ Procedures
- Concept of OOP
- Sci-Kit Learn