Different Types of Data Scientists
- Strong in statistics: they sometimes develop new statistical theories for big data. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
- Strong in mathematics: NSA (national security agency) or defense/military people working on big data, astronomers, and operations researchpeople doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization) as they collect, analyse and extract value out of data.
- Strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API’s, Analytics as a Service, optimization of data flows, data plumbing.
- Strong in machine learning/ computer science (algorithms, computational complexity)
- Strong in business, ROI optimization, decision sciences, (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
- Strong in code development, software engineering(they know a few programming languages)
- Strong in visualization
- Strong in GIS, spatial data, data modeled by graphs, graph databases
- Strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training),
A different categorization would be creative versus mundane. The “creative” category has a better future, as mundane can be outsourced (anything published in textbooks or on the web can be automated or outsourced – job security is based on how much you know that no one else know or can easily learn). Along the same lines, we have science users (those using science, that is, practitioners; often they do not have a PhD), innovators (those creating new science, called researchers), and hybrids. Most data scientists, like geologists helping predict earthquakes, or chemists designing new molecules for big pharma, are scientists, and they belong to the user category.
Data Scientist versus Business Analyst: Business analysts focus on data base design (database modeling, at a high level, including defining metrics, dashboard design, retrieving and producing executive reports and designing alarm systems), ROI assessment on various business projects and expenditures, and budget issues. Some work on marketing or finance planning and optimization, and risk management. Many work on high-level project management, reporting directly to executives.
Some of these tasks are sometimes performed by data scientists as well, particularly in smaller companies: metric creation and definition, high-level data base design (which data should be collected, and how), or computational marketing, even growth hacking (a word recently coined to describe the art of growing Internet traffic exponentially fast, which can involve engineering and analytic skills).
There is also room for data scientists to help the business analyst’s job, for instance by helping automate the production of reports, and make data extraction much faster. You can teach a business analyst FTP and fundamental UNIX commands: ls -l, rm -i, head, tail, cat, cp, mv, sort, grep, uniq -c, and the pipe and redirect operators (|, >). Then you write and install a piece of code on the database server (the server accessed by the business analyst traditionally via a browser or via tools such as Toad or Brio), to retrieve data. Then, all the business analyst will have to do is
- to create a SQL query (even with visual tools) and save it as a SQL text file,
- upload it on the server, and run your program (for instance a Python script, which reads the SQL file and execute it, retrieve the data, and store the results in a CSV file),
- then transfer the output (CSV file) to his machine for further analysis.
Such collaboration is win-win for the business analyst and the data scientist. In practice it has helped business analysts extract data 100 times bigger than what they are used to, and 10 times faster than they are.
Conclusion: Data scientists are not business analysts, but they can greatly help them, including automating the business analyst’s tasks. Also, data scientists might find easier get a job, especially in a company where there is a budget for one position only, and the employer is unsure whether hiring a business analyst (carrying over all analytic and data tasks) or a data scientist (who is business savvy and can perform some of the tasks traditionally assigned to business analysts) if he/she can bring the extra value and experience described here. In general, business analysts are hired first, and if data and algorithms become too complex, a data scientist is brought in. If you create your own startup, you need to wear both hats: data scientist and business analyst.
Data Scientist versus Statistician: Many statisticians think that data science is about analyzing data, but it is more than that. Data science also involves implementing algorithms that process data automatically, to provide automated predictions and actions, such as:
- Automated bidding systems
- Estimating (in real time) the value of all houses in the United States (Zillow.com)
- High-frequency trading
- Matching a Google Ad with a user and a web page to maximize chances of conversion
- Returning highly relevant results to any Google search
- Book and friend recommendations on Amazon.com or Facebook
- Tax fraud detection and detection of terrorism
- Scoring all credit card transactions (fraud detection)
- Computational chemistry to simulate new molecules for cancer treatment
- Early detection of a pandemy
- Analyzing NASA pictures to find new planets or asteroids
- Weather forecasts
- Automated piloting (planes and cars)
- Client-customized pricing system (in real time) for all hotel rooms The problems cover astronomy, fraud detection, social network analytics, search engines, finance (transaction scoring), environment, drug development, trading, engineering, pricing optimization (retail), energy (smart grids), bidding and arbitrage systems.
All this involves both statistical science and terabytes of data. Most people doing this stuff do not call themselves statisticians. They call themselves data scientists.
Statisticians have been gathering data and performing linear regressions for several centuries. DAD (discover / access / distill) performed by statisticians 300 years ago, 20 years ago, today, or in 2015 for that matter, has little to do with DAD performed by data scientists today. The key message here is that eventually, as more statisticians pick up on these new skills and more data scientists pick up on statistical science (sampling, experimental design, confidence intervals – not just the ones described in chapter 5 in our book), the frontier between data scientists and statisticians will blur. Indeed, I can see a new category of data scientists emerging: data scientists with strong statistical knowledge, just we already have a category of data scientists with significant engineering experience (Hadoop).
Also, what makes data scientists different from computer scientists is that they have a much stronger statistics background, especially in computational statistics, but sometimes also in experimental design, sampling, and Monte Carlo simulations.
Data Scientist versus Data Engineer: One of the main differences between a data scientist and a data engineer has to do with ETL versus DAD:
- ETL(Extract/Load/Transform) is for data engineers, or sometimes data architects or database administrators (DBA).
- DAD(Discover/Access /Distill) is for data scientists.
Data engineers tend to focus on software engineering, data base design, production code, and making sure data is flowing smoothly between source (where it is collected) and destination (where it is extracted and processed, with statistical summaries and output produced by data science algorithms, eventually moved back to the source or elsewhere). Data scientists, while they need to understand this data flow (and how it is optimized, especially when working with Hadoop) don’t actually optimize the data flow itself, but rather the data processing step: extracting value from data. But they work with engineers and business people to define the metrics, design data collecting schemes and make sure data science processes integrate efficiently with the enterprise data systems (storage, data flow). This is especially true for data scientists working in small companies, and a reason why data scientists should be able to write code (more and more, Python) re-usable by engineers.
Sometimes data engineers do DAD, and sometimes data scientists do ETL, but it’s not common, and when they do it’s usually internal. For example, the data engineer may do a bit of statistical analysis to optimize some database processes, or the data scientist may do a bit of database management to manage a small, local, private database of summarized information.
DAD is comprised of:
- Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts).
- Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes in-memory within a database.
- Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves
- Exploring the data (creating a data dictionary and exploratory analysis)
- Cleaning (removing impurities)
- Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization)
- Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling
- Presenting results or integrating results in some automated process
Data science is at the intersection of computer science, business engineering, statistics, data mining, machine learning, operations research, six sigma, automation, and domain expertise. It brings together a number of techniques, processes, and methodologies from different fields, together with business vision and action. Data science is about bridging the different components that contribute to business optimization at large, and eliminating the silos that slow down business efficiency. It has its own unique core, too, including (for instance) the following topics discussed in my book (listed in the “related articles” section):
- Clustering and taxonomy creation for large datasets (chapter 2 and 4)
- Internet topology (chapter 4)
- Model-free confidence intervals (chapter 5)
- Analytics as a Service, API’s (chapter 5)
- Hadoop / Map-Reduce (chapter 5)
- Fast feature selection (chapter 6)
- Predictive power of a feature (chapter 6)
- Advanced visualizations (chapter 4)
- The curse of big data (chapter 2)
- What Map-Reduce can’t do (chapter 2)
- Keyword correlations in big data (chapter 4)
- Eleven features any database, SQL, or NoSQL should have (chapter 4)
- Correlation and R-squared for big data (chapter 4)
- Statistical modeling without models (chapter 4)
- Linear regression on an usual domain, hyperplane, sphere, or simplex (chapter 1)
Caveat: Some employers are looking for Java or database developers with strong statistical knowledge. These professionals are very rare, so instead the employers sometimes try to hire a data scientist, hoping he/she is strong in developing production code. If you don’t have that level of Java or database expertise, it can be a waste of time to attend these interviews. You should ask upfront if the position to be filled is a Java developer with statistics knowledge, or a statistician with strong Java skills, during your phone interview, though sometimes the hiring manager is unsure what he really wants, and you might be able to convince him to hire a guy like you if you tell the added value that you expertise brings. It is easier for an employer to get a Java software engineer to learn statistics (especially using this book as training material) than the other way around.
Data Scientist versus Data Architect: I recently had the following discussions with a number of data architects, in different communities, in particular (but not limited to) the TDWI group on LinkedIn. This is a summary of the discussion, featuring differences between data scientists and data architects, and how both can work together.
It shows some of the challenges that still need to be addressed before this new analytics revolution is complete. Following are several questions asked by data architects and database administrators, and my answers. The discussion is about optimizing joins in SQL queries, or just moving away from SQL altogether. Several modern databases now offer many of the features discussed here, including hash table joins and fine-tuning the query optimizer by end users. The discussion illustrates the conflicts between data scientists, data architects, and also business analysts. It also touches on many innovative concepts.
Question: You say that one of the bottlenecks with SQL is users writing queries with (say) three joins, when these queries could be split into two queries each with two joins. Can you elaborate?
Answer: Typically, the way I write SQL code is to embed it into a programming language such as Python, and store all lookup tables that I need as hash tables in memory. So I rarely have to do a join, and when I do, it’s just two tables at most.
In some (rare) cases in which lookup tables were too big to fit in memory, I used sampling methods and worked with subsets and aggregation rules. A typical example is when a field in your data set (web log files) is a user agent (browser, abbreviated as UA). You have more unique UAs than can fit in memory, but as long as you keep the 10 million most popular, and aggregate the 200,000,000 rare UAs into a few million categories (based on UA string), you get good results in most applications.
Being an algorithm expert (not an SQL expert), it takes me a couple minutes to do an efficient four-table join via hash tables in Python (using my own script templates). Most of what I do is advanced analytics, not database aggregation: advanced algorithms, but simple to code in Python, such as hidden decision trees. Anyway, my point here is more about non-expert SQL users such as business analysts: Is it easier or more effective to train them to write better SQL code including sophisticated joins, or to train them to learn Python and blend it with SQL code?
To be more specific, what I have in mind is a system where you have to download the lookup tables not very often (maybe once a week) and access the main (fact) table more frequently. If you must re-upload the lookup tables very frequently, then the Python approach loses its efficiency, and you make your colleagues unhappy because of your frequent downloads that slow down the whole system.
Question: People like you (running Python or Perl scripts to access databases) are a DBA’s worst nightmare. Don’t you think you are a source of problems for DBAs?
Answer: Because I’m much better at Python and Perl than SQL, my Python or Perl code is bug-free, easy-to-read, easy-to-maintain, optimized, robust, and re-usable. If I coded everything in SQL, it would be much less efficient. Most of what I do is algorithms and analytics (machine learning stuff), not querying databases. I only occasionally download lookup tables onto my local machine (saved as hash tables and stored as text files), since most don’t change that much from week to week. When I need to update them, I just extract the new rows that have been added since my last update (based on time stamp). And I do some tests before running an intensive SQL script to get an idea of how much time and resources it will consume, and to see whether I can do better. I am an SQL user, just like any statistician or business analyst, not an SQL developer.
But I agree we need to find a balance to minimize data transfers and processes, possibly by having better analytic tools available where the data resides. At a minimum, we need the ability to easily deploy Python code there in non-production tables and systems, and be allocated a decent amount of disk space (maybe 200 GB) and memory (at least several GB).
Question: What are your coding preferences?
Answer: Some people feel more comfortable using a scripting language rather than SQL. SQL can be perceived as less flexible and prone to errors, producing wrong output without anyone noticing due to a bug in the joins.
You can write simple Perl code, which is easy to read and maintain. Perl enables you to focus on the algorithms rather than the code and syntax. Unfortunately, many Perl programmers write obscure code, which creates a bad reputation for Perl (code maintenance and portability). But this does not have to be the case.
You can break down a complex join into several smaller joins using multiple SQL statements and views. You would assume that the DB engine would digest your not-so-efficient SQL code and turn it into something much more efficient. At least you can test this approach and see if it works as fast as one single complex query with many joins. Breaking down multiple joins into several simple statements allows business analysts to write simple SQL code, which is easy for fellow programmers to reuse or maintain.
It would be interesting to see some software that automatically corrects SQL syntax errors (not SQL logical errors). It would save lots of time for many non-expert SQL coders like me, as the same typos that typically occur over and over could be automatically fixed. In the meanwhile, you can use GUIs to produce decent SQL code, using tools provided by most database vendors or open-source, such as Toad for Oracle.
Question: Why do you claim that these built-in SQL optimizers are usually black-box technology for end users? Do you think parameters can’t be fine-tuned by the end user?
Answer: I always like to have a bit of control over what I do, though not necessary a whole lot. For instance, I’m satisfied with the way Perl handles hash tables and memory allocation. I’d rather use the Perl black-box memory allocation/hash table management system than creating it myself from scratch in C, or even worse, write a compiler. I’m just a bit worried with black-box optimization — I’ve seen the damage created by non-expert users who recklessly used black-box statistical software. I’d feel more comfortable if I had at least a bit of control, even as simple as sending an email to the DBA, having her look at my concern or issue, and having her help improve my queries, maybe fine-tuning the optimizer, if deemed necessary and beneficial for the organization and to other users.
Question: Don’t you think tour approach is 20 years old?
Answer: The results are more important than the approach, as long as the process is reproducible. If I can beat my competitors (or help my clients do so) with whatever tools I use, as one would say “”if it ain’t broke, don’t fix it.” Sometimes I use APIs (for example, Google API’s), sometimes I use external data collected with a web crawler, sometimes Excel or Cubes are good enough, and sometimes vision combined with analytic acumen and intuition (without using any data) works well. Sometimes I use statistical models, and other times a very modern architecture is needed. Many times, I use a combination of many of these. I have several examples of “light analytics” doing better than sophisticated architectures
Question: Why did you ask whether your data-to-analytic approach makes sense?
Answer: The reason I asked the question is because something has been bothering me, based on not-so-old observations (3-4 years old) in which the practices that I mention are well entrenched in the analytic community (by analytic, I mean machine learning, statistics, and data mining, not ETL). It is also an attempt to see if it’s possible to build a better bridge between two very different communities: data scientists and data architects. Database builders often (but not always) need the data scientist to bring insights and value out of organized data. And the data scientists often (but not always) need the data architect to build great, fast, efficient data processing systems so they can better focus on analytics.
Question: So you are essentially maintaining a cache system with regular, small updates to a local copy of the lookup tables. Two users like you doing the same thing would end up with two different copies after some time. How do you handle that?
Answer: You are correct that two users having two different copies (cache) of lookup tables causes problems. Although in my case, I tend to share my cache with other people, so it’s not like five people working on five different versions of the lookup tables. Although I am a senior data scientist, I am also a designer/architect, but not a DB designer/architect, so I tend to have my own local architecture that I share with a team. Sometimes my architecture is stored in a local small DB and occasionally on the production databases, but many times as organized flat files or hash tables stored on local drives, or somewhere in the cloud outside the DB servers, though usually not very far if the data is big. Many times, my important “tables” are summarized extracts — either simple aggregates that are easy to produce with pure SQL, or sophisticated ones such as transaction scores (by client, day, merchant, or more granular) produced by algorithms too complex to be efficiently coded in SQL.
The benefit of my “caching” system is to minimize time-consuming data transfers that penalize everybody in the system. The drawback is that I need to maintain it, and essentially, I am replicating a process already in place in the database system itself.
Finally, for a statistician working on data that is almost correct (not the most recent version of the lookup table, but rather data stored in this “cache” system and updated rather un-frequently), or working on samples, this is not an issue. Usually the collected data is an approximation of a signal we try to capture and measure — it is always messy. The same can be said about predictive models, the ROI extracted from a very decent dataset (my “cache”), the exact original most-recent version of the dataset, or a version where 5 percent of noise is artificially injected into it — it is pretty much the same in most circumstances.
Question: Can you comment on code maintenance and readability?
Answer: Consider the issue of code maintenance when someone writing obscure SQL leaves the company — or worse, when SQL is ported to a different environment (not SQL) — and it’s a nightmare for the new programmers to understand the original code. If easy-to-read SQL (maybe with more statements, fewer elaborate high-dimensional joins) runs just as fast as one complex statement because of the internal user-transparent query optimizer, why not use the easy-to-read code instead? After all, the optimizer is supposed to make both approaches equivalent, right? In other words, if two pieces of code (one short and obscure; one longer and easy to read, maintain, and port) have the same efficiency because they are essentially turned into the same pseudo-code by the optimizer, I would favor the longer version that takes less time to write, debug, maintain, and so on.
There might be a market for a product that turns ugly, obscure, yet efficient code into nice, easy-to-read SQL — an “SQL beautifier.” It would be useful when migrating code to a different platform. Although this already exists to some extent, you can easily visualize any query or sets of queries in all DB systems with diagrams. The SQL beautifier would be in some ways similar to a program that translates Assembler into C++. In short, a reverse compiler or interpreter.