- Learn To Code
- Get Familiar With Git
- Agile Development
- Software Engineering Culture
- Learn how a Computer Works
- Data Network Transmission
- Security and Privacy
- Linux
- Docker
- The Cloud
- Security Zone Design
Why this is important: Without coding you cannot do much in data engineering. I cannot count the number of times I needed a quick Java hack.
The possibilities are endless:
-
Writing or quickly getting some data out of a SQL DB
-
Testing to produce messages to a Kafka topic
-
Understanding the source code of a Java Webservice
-
Reading counter statistics out of a HBase key value store
So, which language do I recommend then?
I highly recommend Java. It's everywhere!
When you are getting into data processing with Spark you should use Scala. But, after learning Java this is easy to do.
Also Python is a great choice. It is super versatile.
Personally however, I am not that big into Python. But I am going to look into it
Where to Learn? There's a Java Course on Udemy you could look at: https://www.udemy.com/java-programming-tutorial-for-beginners
-
OOP Object oriented programming
-
What are Unit tests to make sure what you code is working
-
Functional Programming
-
How to use build management tools like Maven
-
Resilient testing (?)
I talked about the importance of learning by doing in this podcast: https://anchor.fm/andreaskayy/episodes/Learning-By-Doing-Is-The-Best-Thing-Ever---PoDS-035-e25g44
Why this is important: One of the major problems with coding is to keep track of changes. It is also almost impossible to maintain a program you have multiple versions of.
Another problem is the topic of collaboration and documentation, which is super important.
Let's say you work on a Spark application and your colleagues need to make changes while you are on holiday. Without some code management they are in huge trouble:
Where is the code? What have you changed last? Where is the documentation? How do we mark what we have changed?
But if you put your code on GitHub your colleagues can find your code. They can understand it through your documentation (please also have in-line comments)
Developers can pull your code, make a new branch and do the changes. After your holiday you can inspect what they have done and merge it with your original code and you end up having only one application.
Where to learn: Check out the GitHub Guides page where you can learn all the basics: https://guides.github.com/introduction/flow/
This great GitHub commands cheat sheet saved my butt multiple times: https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet
Also look into:
-
Pull
-
Push
-
Branching
-
Forking
GitHub uses markdown to write pages. A super simple language that is actually a lot of fun to write. Here's a markdown cheat cheatsheet: https://www.markdownguide.org/cheat-sheet/
Pandoc is a great tool to convert any text file from and to markdown: https://pandoc.org
Agility, the ability to adapt quickly to changing circumstances.
These days everyone wants to be agile. Big or small company people are looking for the "startup mentality".
Many think it's the corporate culture. Others think it's the process how we create things that matters.
In this article I am going to talk about agility and self-reliance. About how you can incorporate agility in your professional career.
Historically, development is practiced as an explicitly defined process. You think of something, specify it, have it developed and then built in mass production.
It's a bit of an arrogant process. You assume that you already know exactly what a customer wants. Or how a product has to look and how everything works out.
The problem is that the world does not work this way!
Often times the circumstances change because of internal factors.
Sometimes things just do not work out as planned or stuff is harder than you think.
You need to adapt.
Other times you find out that you build something customers do not like and need to be changed.
You need to adapt.
That's why people jump on the Scrum train. Because Scrum is the definition of agile development, right?
Yes, Scrum or Google's OKR can help to be more agile. The secret to being agile however, is not only how you create.
What makes me cringe is people trying to tell you that being agile starts in your head. So, the problem is you?
No!
The biggest lesson I have learned over the past years is this: Agility goes down the drain when you outsource work.
I know on paper outsourcing seems like a no-brainer: Development costs against the fixed costs.
It is expensive to bind existing resources on a task. It is even more expensive if you need to hire new employees.
The problem with outsourcing is that you pay someone to build stuff for you.
It does not matter who you pay to do something for you. He needs to make money.
His agenda will be to spend as little time as possible on your work. That is why outsourcing requires contracts, detailed specifications, timetables and delivery dates.
He doesn't want to spend additional time on a project, only because you want changes in the middle. Every unplanned change costs him time and therefore money.
If so, you need to make another detailed specification and a contract change.
He is not going to put his mind into improving the product while developing. Firstly because he does not have the big picture. Secondly because he does not want to.
He is doing as he is told.
Who can blame him? If I was the subcontractor I would do exactly the same!
Does this sound agile to you?
Doing everything in house, that's why startups are so productive. No time is wasted on waiting for someone else.
If something does not work, or needs to be changed, there is someone in the team who can do it right away.
One very prominent example who follows this strategy is Elon Musk.
Tesla's Gigafactories are designed to get raw materials in on one side and spit out cars on the other. Why do you think Tesla is building Gigafactories who cost a lot of money?
Why is SpaceX building its one space engines? Clearly there are other, older, companies who could do that for them.
Why is Elon building tunnel boring machines at his new boring company?
At first glance this makes no sense!
If you look closer it all comes down to control and knowledge. You, your team, your company, needs to do as much as possible on your own. Self-reliance is king.
Build up your knowledge and therefore the teams knowledge. When you have the ability to do everything yourself, you are in full control.
You can build electric cars, rocket engines or bore tunnels.
Don't largely rely on others and be confident to just do stuff on your own.
Dream big and JUST DO IT!
PS. Don't get me wrong. You can still outsource work. Just do it in a smart way by outsourcing small independent parts.
There's an interesting Scrum Medium publication with a lot of details about Scrum: https://medium.com/serious-scrum
Also this scrum guide webpage has good infos about Scrum: https://www.scrumguides.org/scrum-guide.html
I personally love OKR, and have been using it for years. Especially for smaller teams, OKR is great. You don't have a lot of overhead and get work done. It helps you stay focused and look at the bigger picture.
I recommend to do a sync meeting every Monday. There you talk about what happened last week and what you are going to work on this week.
I talked about this in this Podcast: https://anchor.fm/andreaskayy/embed/episodes/Agile-Development-Is-Important-But-Please-Dont-Do-Scrum--PoDS-041-e2e2j4
This is also this awesome 1,5 hours startup guide from Google: https://youtu.be/mJB83EZtAjc I really love this video, I rewatched it multiple times.
The software engineering and development culture is super important. How does a company handle product development with hundreds of developers. Check out this podcast:
Podcast Episode: #070 Engineering Culture At Spotify |
---|
In this podcast we look at the engineering culture at Spotify, my favorite music streaming service. The process behind the development of Spotify is really awesome. |
Watch on YouTube \ Listen on Anchor |
Some interesting slides:
https://labs.spotify.com/2014/03/27/spotify-engineering-culture-part-1/
https://labs.spotify.com/2014/09/20/spotify-engineering-culture-part-2/
I talked about computer hardware and GPU processing in this podcast: https://anchor.fm/andreaskayy/embed/episodes/Why-the-hardware-and-the-GPU-is-super-important--PoDS-030-e23rig
The OSI Model describes how data is flowing through the network. It consists of layers starting from physical layers, basically how the data is transmitted over the line or optic fiber.
Cisco page that shows the layers of the OSI model and how it works: https://learningnetwork.cisco.com/docs/DOC-30382
Check out this page: https://www.studytonight.com/computer-networks/complete-osi-model
The Wikipedia page is also very good: https://en.wikipedia.org/wiki/OSI_model
Check out this network protocol map. Unfortunately it is really hard to find it theses days: https://www.blackmagicboxes.com/wp-content/uploads/2016/12/Network-Protocols-Map-Poster.jpg
Check out this IP Address and Subnet guide from Cisco: https://www.cisco.com/c/en/us/support/docs/ip/routing-information-protocol-rip/13788-3.html
A calculator for Subnets: https://www.calculator.net/ip-subnet-calculator.html
For an introduction to how Ethernet went from broadcasts, to bridges, to Ethernet MAC switching, to Ethernet & IP (layer 3) switching, to software defined networking, and to programmable data plane that can switch on any packet field and perform complex packet processing, see this video: https://youtu.be/E0zt_ZdnTcM?t=144
I talked about Network Infrastructure and Techniques in this podcast: https://anchor.fm/andreaskayy/embed/episodes/IT-Networking-Infrastructure-and-Linux-031-PoDS-e242bh
Link to the Wiki page: https://en.wikipedia.org/wiki/JSON_Web_Token
The EU created the GDPR "General Data Protection Regulation" to protect your personal data like: Your name, age, where you live and so on.
It's huge and quite complicated. If you want to do online business in the EU you need to apply these rules. The GDPR is applicable since May 25th 2018. So, if you haven't looked into it, now is the time.
The penalties can be crazy high if you do mistakes here.
Check out the full GDPR regulation here: https://gdpr-info.eu
By the way, if you do profiling or in general analyse big data, look into it. There are some important regulations. Unfortunately.
I spend months with GDPR compliance. Super fun. Not! Hahaha
When should you look into privacy regulations and solutions?
Creating the product or service first and then bolting on the privacy is a bad choice. The best way is to start implementing privacy right away in the engineering phase.
This is called privacy by design. Privacy as an integral part of your business, not just something optional.
Check out the Wikipedia page to get a feeling of the important principles: https://en.wikipedia.org/wiki/Privacy_by_design
Linux is very important to learn, at least the basics. Most Big Data tools or NoSQL databases are running on Linux.
From time to time you need to modify stuff through the operation system. Especially if you run an infrastructure as a service solution like Cloudera CDH, Hortonworks or a MapR Hadoop distribution.
Show all historic commands
history | grep docker
Ah, creating shell scripts in 2019? Believe it or not scripting in the command line is still important.
Start a process, automatically rename, move or do a quick compaction of log files. It still makes a lot of sense.
Check out this cheat sheet to get started with scripting in Linux: https://devhints.io/bash
There's also this Medium article with a super simple example for beginners: https://medium.com/@saswat.sipun/shell-scripting-cheat-sheet-c0ecfb80391
Cron jobs are super important to automate simple processes or jobs in Linux. You need this here and there I promise. Check out this three guides:
https://linuxconfig.org/linux-crontab-reference-guide
https://www.ostechnix.com/a-beginners-guide-to-cron-jobs/
And of course Wikipedia, which is surprisingly good: https://en.wikipedia.org/wiki/Cron
Pro tip: Don't forget to end your cron files with an empty line or a comment, otherwise it will not work.
Linux Tips are the second part of this podcast: https://anchor.fm/andreaskayy/embed/episodes/IT-Networking-Infrastructure-and-Linux-031-PoDS-e242bh
Have you played around with Docker yet? If you're a data science learner or a data scientist you need to check it out!
It's awesome because it simplifies the way you can set up development environments for data science. If you want to set up a dev environment you usually have to install a lot of packages and tools.
What this does is you basically mess up your operating system. If you're just starting out, you don't know which packages you need to install. You don't know which tools you need to install.
If you want to for instance start with Jupyter notebooks you need to install that on your PC somehow. Or you need to start installing tools like PyCharm or Anaconda.
All that gets added to your system and so you mess up your system more and more and more. What Docker brings you, especially if you're on a Mac or a Linux system is simplicity.
Because it is so easy to install on those systems. Another cool thing about docker images is you can just search them in the Docker store, download them and install them on your system.
Running them in a completely pre-configured environment. You don't need to think about stuff, you go to the Docker library you search for Deep Learning, GPU and Python.
You get a list of images you can download. You download one, start it up, you go to the browser hit up the URL and just start coding.
Start doing the work. The only other thing you need to do is bind some drives to that instance so you can exchange files. And then that's it!
There is no way that you can crash or mess up your system. It's all encapsulated into Docker. Why this works is because Docker has native access to your hardware.
It's not a completely virtualized environment like a VirtualBox. An image has the upside that you can take it wherever you want. So if you're on your PC at home use that there.
Make a quick build, take the image and go somewhere else. Install the image which is usually quite fast and just use it like you're at home.
It's that awesome!
I am getting into Docker a lot more myself. For a bit different reasons.
What I'm looking for is using Docker with Kubernetes. With Kubernetes you can automate the whole container deployment process.
The idea with is that you have a cluster of machines. Lets say you have a 10 server cluster and you run Kubernetes on it.
Kubernetes lets you spin up Docker containers on-demand to execute tasks. You can set up how much resources like CPU, RAM, Network, Docker container can use.
You can basically spin up containers, on the cluster on demand. When ever you need to do a analytics task.
Perfect for Data Science.
Podcast about how data science learners use Docker (for data scientists): https://anchor.fm/andreaskayy/embed/episodes/Learn-Data-Science-Go-Docker-e10n7u
Create a container:
docker run CONTAINER --network NETWORK
Start a stopped container:
docker start CONTAINER NAME
Stop a running container:
docker stop
List all running containers
docker ps
List all containers including stopped ones
docker ps -a
Inspect the container configuration. For instance network settings and so on:
docker inspect CONTAINER
List all available virtual networks:
docker network ls
Create a new network:
docker network create NETWORK --driver bridge
Connect a running container to a network
docker network connect NETWORK CONTAINER
Disconnect a running container from a network
docker network disconnect NETWORK CONTAINER
Remove a network
docker network rm NETWORK
Check out this Podcast it will help you understand where's the difference and how to decide on what you are going to use.
Podcast Episode: #082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS |
---|
In this episode we are talking about the differences between infrastructure as a service, platform as a service and application as a service. Then we install the Nifi docker container and look into how we can extract the twitter data. |
Watch on YouTube \ Listen on Anchor |
Each of these have there own answer to IaaS, Paas and SaaS. Pricing and pricing models vary greatly between each provider. Likewise each provider's service may have limitations and strengths.
Google's offerings referred to as Google Cloud Platform provides wide variety of services that is ever evolving. List of GCP services with brief description. In recent years documentation and tutorials have com a long way to help getting started with GCP. You can start with a free account but to use many of the services you will need to turn on billing. Once you do enable billing always remember to turn off services that you have spun up for learning purposes. It is also a good idea to turn on billing limits and alerts.
Podcast Episode: #076 Cloud vs On-Premise |
---|
How do you choose between Cloud vs On-Premises, pros and cons and what you have to think about. Because there are good reasons to not go cloud. Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane. My suggestion: Basically use them as IaaS and something like Cloudera as PaaS. Then build your solution on top of that. |
Watch on YouTube \ Listen on Anchor |
Listen to a few thoughts about the cloud in this podcast: https://anchor.fm/andreaskayy/embed/episodes/Dont-Be-Arrogant-The-Cloud-is-Safer-Then-Your-On-Premise-e16k9s
Hybrid clouds are a mixture of on-premises and cloud deployment. A very interesting example for this is Google Anthos:
https://cloud.google.com/anthos/
(UI in different zone then SQL DB)
I talked about security zone design and lambda architecture in this podcast: https://anchor.fm/andreaskayy/embed/episodes/How-to-Design-Security-Zones-and-Lambda-Architecture--PoDS-032-e248q2