FAQ Stats

Introduction Statistics forms the back bone of data science or any analysis for that matter. Sound knowledge of statistics can help an analyst to make sound business decisions. On one hand, descriptive statistics helps us to understand the data and its properties by use of central tendency and variability. On the other hand, inferential statistics…

Clustering

Clustering is an unsupervised learning method, meaning that clustering is not used to predict an outcome, or dependent variable. The main goal is to segment a set of observations into similar groups, based on the available data. However, although clustering is not designed to predict anything, clustering can be useful to improve the accuracy of…

Linear Regression

Linear regression – is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by y and the independent variables are denoted by x1,x2,…xk, where k is the number of different independent variables. We are interested in…

Logistic Regression

Logistic regression – Extends the idea of linear regression to cases where the dependent variable, y, only has two possible outcomes, called classes. Examples of dependent variables that could be used with logistic regression are predicting whether a new business will succeed or fail, predicting the approval or disapproval of a loan, and predicting whether a stock…

Text Analytics

Text analytics is a set of techniques that model and structure the information content of textual sources, which are frequently loosely structured and complex. The ultimate goal is to convert text into data for analysis. One popular and commonly-used text analytics technique is called “bag of words“. While fully understanding text is difficult, this approach…

CART & Random Forests

Classification and regression trees (CART) and Random forests are both tree-based methods. Trees are flexible data-driven methods to determine an outcome using splits, or logical rules, on the independent variables. Trees have the ability to more easily capture nonlinear relationships than linear and logistic regression, and can be used for both a continuous outcome (like…

Intro to R

R was written by statisticians Ross Ihaka and Robert Gentleman Data is plural of Datum Questions that are to be asked before ingesting data can be abbreviated as FLOPS  Format of your data ?  Location of data ?  One-time activity ?  Preprocessing required ?  Size of the dataset ? Is your data variable categorical or numerical…