Just a few years ago, there was no position in technology called a Data Scientist. In the next decade, this could be the most important position in IT. After all, we need someone to make sense out of the massive amount of data being created. The Data Scientist could become the most sought after and highly compensated position in new tech.
Position Created From Growth of Data
Data is being accumulated at an incredible rate. Experts suggest that by the end of 2010, 1.2 Zettabytes of data will exist in the world which represents an increase of 50% over 2009. What's a Zettabyte you ask? A Zettabyte is equal to 1 billion terabytes.
To give you a scale of how much data this is, consider these examples:
- Mark Liberman has stated that the storage requirements for all human speech for all of time are 42 Zettabytes. This assumes 16 kHz 16-bit audio encoding.
- If every man, woman and child everywhere in the world tweeted continuously for 100 years, that would amount to 1 Zettabyte of data.
- 1 Zettabyte is the amount of data held on 75 billion 16 GB Apple iPads (full).
I think you get the point. There is a lot of data in the world today. We generally do a poor job dealing with data at our companies. Imagine trying to mine the world's data and trying to make sense out of it. How about trying to monetize it. We should be thankful that someone is taking on that challenge.
What is a Data Scientist?
A Data Scientist is a computer oriented scientist with experience in data mining, analysis, visualization and predictive modeling as it relates to very large volumes of data. The primary job of a data scientist is to bring together related and unrelated data and turn it into information to support decisions and exploit opportunities. Data Science will create opportunities for start-ups, and most people feel we are at the beginning of the wave.
We are beginning to see Data Scientists at emerging tech companies in addition to the Fortune 500. Advances in data warehousing, cloud computing and better software has made it possible to mine massive amounts of data using desktop tools.
Companies such as Facebook, Twitter and LinkedIn are sitting on mountains of data. We are beginning to see visualizations showing how things are connected and related. Take a look at the Facebook visualization created by Paul Butler showing the connected world of Facebook users. It's amazing. B2B and B2C companies are also accumulating volumes of data on their customers and users. In addition, there are public suppliers of data that have been emerging in the last couple of years.
The Data Scientist is tasked with digging into all of that data, creating relationships to connect it and turning it into actionable information. More importantly, the Data Scientist is trying to figure out how their companies can make money from the information provided.
Skills Required to be a Data Scientist
What makes a Data Scientist valuable is the combination of skills required to do the job. Each of the skills mentioned here can be a career all by themselves. With all these skills, you will be in high demand and working on the cutting edge of new technology.
- Computer Science:
A solid background in programming and computer science is required to be a Data Scientist. One of the principles of data science is that data is in many different sources. A Data Scientist needs to be able to extract data from databases, such as SQL Server, Oracle, and NoSQL platforms. In addition, they must be able to programmatically consume web services and parse data from XML or text files. Without this skill, accumulating data would be virtually impossible.
- Mathematics, Statistics, & Data Mining:
This is pretty obvious but a Data Scientist must have a background in statistics and mathematics. In addition, you need to be able to work with popular statistical software such as SPSS and SAS.
- Graphic Design & Visualization:
The ability to create visualizations that tell the story you are trying to convey is critical to the Data Scientist. The importance of graphics, maps, dashboards and other visualizations is evident in the road maps of all the major database providers. There are also a number of companies whose entire portfolio of products is based on solely on data visualizations.
Being creative is essential to today's Data Scientist. The ability to combine disparate data without keys or scraping a website requires skills common in hackers.