Data Science Interview Questions and Answers - 2

8. What are outliers? What are the various ways to detect them?

Outliers are also referred to as anomalies. These are the values in the dataset that lie far away from all other values.

It is important to deal with them because when you are creating the Machine Learning Models, you have to either explain the significance for their occurrence or get rid of them so that they don't disturb the model.

Following are the most commonly used methods to detect them:

i.) Standard Deviation - Here, usually if a value is three times the standard deviation, it is an outlier.

ii.) Boxplots - The data here is plotted on the graph. The boundaries of data are called upper and lower whiskers. Any values that lie on or beyond these whiskers are anomalies.

iii.) DBScan Clustering - This method converts the data into clusters which has core points and border points. To consider the two points to be a part of a cluster, a maximum distance "eps" is calculated. Any values that fall beyond border points are called as noise points.

The biggest challenge with this method is the right calculation of "eps".

iv.) Isolation Forest - This method works differently than all other methods. It assigns a score to each data point and believes that the anomalies are only a few in numbers and their attribute values are different from the normal values. This method works well with large datasets.

v.) Random Cut Forest - This method also works by associating a score with the data values. Low score value means that the data is normal while the high score value tags it to be an anomaly. This method works with both online and offline data & can take care of high dimensional data.

9. What are some common statistical problems that you would always stay attentive to as a Data Scientist?

Some of the statistical things that I would stay attentive to as a Data Scientist are:

i.) Ensure that the dataset is of high quality with no missing or redundant values.

ii.) Understand the objective function clearly so that you can build a good model that meets your expectations.

iii.) Look at the data closely and ask which model would work the best and why.

iv.) Make sure that the data getting into the system at the time of running it is same as per your assumptions. A different data would get you wrong predictions.

v.) Run your model in actual out sample environment to ensure that it runs well in all the conditions

vi.) Work with a small set of data to begin with and ensure that your approach is right. A wrong output doesn't always mean lack of data but also points towards your approach.