Data Analyst Interview Questions and Answers - 4

24. What are the important steps in the process of data validation?

Validation of data is performed in two major steps:

i.) Data Screening - Here, the entire data is screened for any values that look susceptible to error or questionable. These values are closely examined and handled as per the guidelines.   
ii.) Data Verification - In this step every suspected value is individually examined and a decision is to be made if it should be accepted in the process as a valid value or rejected. Some such values are also replaced with other values.

25. What is logistic regression?

It is a statistical method to examine a dataset that has one or multiple independent variables affecting the outcome.

Explain Outlier.

It is that value of data in the sample that is far off from overall pattern in a sample. There are two types of Outliers:

i.) Univariate
ii.) Multivariate

Describe Logic Regression.

Logic Regression is a statistical method used to examine a dataset with one or more independent variables, defining an outcome.

27. What is a hash table? What are hash table collisions?

Hash table is a data structure used to implement an associative array. It is a map of keys to values.

Hash table collision occurs when the same value is hashed by two different keys. It can be avoided by using techniques like Open Addressing and Separate Chaining.

28. What is the difference between Separate Chaining and Open Addressing?

Separate Chaining uses data structure to store multiple items that hash to the same slot while Open addressing works by searching for other slots using another function and storing items in first empty slot.

29. What is collaborative filtering?

It is an algorithm that creates a list of recommendations based on the behavioral data the system has about the user. For e.g. Do you see the list of "recommended for you" list on Youtube homepage every time you open it. This is based on the information the system has about you as a user.

30. What do you know about n-gram?

n-gram is a probabilistic language model used for predicting the next item in a sequence.

31. What do you know about Map Reduce?

Map reduce is a framework that processes large data sets. It splits the large data set into smaller subsets, processes each of them on separate server and gives out the results.