Data Science Interview Questions and Answers
Last updated on 25th Sep 2020, Blog, Interview Question
It is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.Data science is related to data mining, machine learning and big data.
Data science is a “concept to unify statistics, data analysis and their related methods” in order to “understand and analyze actual phenomena” with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science. Turing award winner Jim Gray imagined data science as a “fourth paradigm” of science (empirical, theoretical, computational and now data-driven) and asserted that “everything about science is changing because of the impact of information technology”
1. What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?
There are many steps that can be taken when data wrangling and data cleaning. Some of the most common steps are listed below:
- 1. Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe().
- 2. Data visualizations: Sometimes, it’s useful to visualize your data with histograms, box plots, and scatterplots to better understand the relationships between variables and also to identify potential outliers.
- 3. Syntax error: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs.
- 4. Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.
- 5. Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values.
- 6. Other things include: removing irrelevant data, removing duplicates, and type conversion.
2. How to deal with unbalanced binary classification?
There are a number of ways to handle unbalanced binary classification (assuming that you want to identify the minority class):
- First, you want to reconsider the metrics that you’d use to evaluate your model. The accuracy of your model might not be the best metric to look at because and I’ll use an example to explain why. Let’s say 99 bank withdrawals were not fraudulent and 1 withdrawal was. If your model simply classified every instance as “not fraudulent”, it would have an accuracy of 99%! Therefore, you may want to consider using metrics like precision and recall.
- Another method to improve unbalanced binary classification is by increasing the cost of misclassifying the minority class. By increasing the penalty of such, the model should classify the minority class more accurately.
- Lastly, you can improve the balance of classes by oversampling the minority class or by undersampling the majority class.
3. What is the difference between a box plot and a histogram?
Box Plot vs Histogram
While boxplots and histograms are visualizations used to show the distribution of the data, they communicate information differently.
Histograms are bar charts that show the frequency of a numerical variable’s values and are used to approximate the probability distribution of the given variable. It allows you to quickly understand the shape of the distribution, the variation, and potential outliers.
Boxplots communicate different aspects of the distribution of data. While you can’t see the shape of the distribution through a box plot, you can gather other information like the quartiles, the range, and outliers. Boxplots are especially useful when you want to compare multiple charts at the same time because they take up less space than histograms.
How to read a boxplot
4. Describe different regularization methods, such as L1 and L2 regularization?
Both L1 and L2 regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance.
L2 Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance.
If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization.
L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.
StatQuest has an amazing video on Lasso and Ridge regression here.
5. Neural Network Fundamentals
A neural network is a multi-layered model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid activation function.
6. What is cross-validation?
Ans:Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.
7. How to define/select metrics?
Ans:There isn’t a one-size-fits-all metric. The metric(s) chosen to evaluate a machine learning model depends on various factors:
- Is it a regression or classification task?
- What is the business objective? Eg. precision vs recall
- What is the distribution of the target variable?
There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and the list goes on.
8. Explain what precision and recall are
Recall attempts to Answer “What proportion of actual positives was identified correctly?”
Precision attempts to Answer “What proportion of positive identifications was actually correct?”
9. Explain what a false positive and a false negative are. Why is it important for each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important
Ans:A false positive is an incorrect identification of the presence of a condition when it’s absent.
A false negative is an incorrect identification of the absence of a condition when it’s actually present.
An example of when false negatives are more important than false positives is when screening for cancer. It’s much worse to say that someone doesn’t have cancer when they do, instead of saying that someone does and later realizing that they don’t.
This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don’t expect to win the lottery anyways.
10. What is the difference between supervised learning and unsupervised learning? Give concrete examples
Supervised learning involves learning a function that maps an input to an output based on example input-output pairs .
For example, if I had a dataset with two variables, age (input) and height (output), I could implement a supervised learning model to predict the height of a person based on their age.
Unlike supervised learning, unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes. A common use of unsupervised learning is grouping customers by purchasing behavior to find target markets.
Subscribe For Free Demo[contact-form-7 404 "Not Found"]
11. Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model
There are two main ways that you can do this:
A) Adjusted R-squared.
R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit.
However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model.
A method common to most people is cross-validation, splitting the data into two sets: training and testing data. See the Answer to the first question for more on this.
12. What does NLP stand for?
NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.
13. When would you use random forests Vs SVM and why?
There are a couple of reasons why a random forest is a better choice of model than a support vector machine:
- Random forests allow you to determine the feature importance. SVM’s can’t do this.
- Random forests are much quicker and simpler to build than an SVM.
- For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.
14. Why is dimension reduction important?
Dimensionality reduction is the process of reducing the number of features in a dataset. This is important mainly in the case when you want to reduce variance in your model (overfitting).
- 1. It reduces the time and storage space required
- 2. Removal of multicollinearity improves the interpretation of the parameters of the machine learning model
- 3. It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
- 4. It avoids the curse of dimensionality
15. What is principal component analysis? Explain the sort of problems you would use PCA for.
In its simplest sense, PCA involves projecting higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model.
PCA is commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.
16. Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
One major drawback of Naive Bayes is that it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.
One way to improve such an algorithm that uses Naive Bayes is by decorrelating the features so that the assumption holds true.
17.What are the drawbacks of a linear model?
There are a couple of drawbacks of a linear model:
- A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
- A linear model can’t be used for discrete or binary outcomes.
- You can’t vary the model flexibility of a linear model.
18.Do you think 50 small decision trees are better than a large one? Why?
Another way of asking this question is “Is a random forest a better model than a decision tree?” And the Answer is yes because a random forest is an ensemble method that takes many weak decision trees to make a strong learner. Random forests are more accurate, more robust, and less prone to overfitting.
19. Why is mean square error a bad measure of model performance? What would you suggest instead?
Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).
20. What are the assumptions required for linear regression? What if some of these assumptions are violated?
The assumptions are as follows:
- 1. The sample data used to fit the model is representative of the population
- 2. The relationship between X and the mean of Y is linear
- 3. The variance of the residual is the same for any value of X (homoscedasticity)
- 4. Observations are independent of each other
- 5. For any value of X, Y is normally distributed.
Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.
21. What is collinearity and what to do with it? How to remove multicollinearity?
Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.
You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables — a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.
22. How to check if the regression model fits the data well?
There are a couple of metrics that you can use:
R-squared/Adjusted R-squared: Relative measure of fit. This was explained in a previous Answer
F1 Score: Evaluates the null hypothesis that all regression coefficients are equal to zero vs the alternative hypothesis that at least one doesn’t equal zero
RMSE: Absolute measure of fit.
23. What is a decision tree?
Decision trees are a popular model, used in operations research, strategic planning, and machine learning. Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. Decision trees are intuitive and easy to build but fall short when it comes to accuracy.
24. What is a random forest? Why is it good?
Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. By relying on a “majority wins” model, it reduces the risk of error from an individual tree.
For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of random forests.
Random forests offer several other benefits including strong performance, can model non-linear boundaries, no cross-validation needed, and gives feature importance.
25. What is a kernel? Explain the kernel trick
A kernel is a way of computing the dot product of two vectors ?x and ?y in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product” 
The kernel trick is a method of using a linear classifier to solve a nonlinear problem by trAnsforming linearly inseparable data to linearly separable ones in a higher dimension.
26. Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
When the number of features is greater than the number of observations, then performing dimensionality reduction will generally improve the SVM.
27.What is overfitting?
Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.
28.What is boosting?
Boosting is an ensemble method to improve a model by reducing its bias and variance, ultimately converting weak learners to strong learners. The general idea is to train a weak learner and sequentially iterate and improve the model by learning from the previous learner
29. The probability that an item at location A is 0.6, and 0.8 at location B. What is the probability that item would be found on Amazon website?
We need to make some assumptions about this question before we can Answer it. Let’s assume that there are two possible places to purchase a particular item on Amazon and the probability of finding it at location A is 0.6 and B is 0.8. The probability of finding the item on Amazon can be explained as so:
We can reword the above as P(A) = 0.6 and P(B) = 0.8. Furthermore, let’s assume that these are independent events, meaning that the probability of one event is not impacted by the other. We can then use the formula…
P(A or B) = P(A) + P(B) — P(A and B)
P(A or B) = 0.6 + 0.8 — (0.6*0.8)
P(A or B) = 0.92
30.You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?
This can be Answered using the Bayes Theorem. The extended equation for the Bayes Theorem is the following:
Assume that the probability of picking the unfair coin is denoted as P(A) and the probability of flipping 10 heads in a row is denoted as P(B). Then P(B|A) is equal to 1, P(B∣¬A) is equal to 0.⁵¹⁰, and P(¬A) is equal to 0.99.
If you fill in the equation, then P(A|B) = 0.9118 or 91.18%.
Advance your Career with Data Science Training By World Class Faculty
- Instructor-led Sessions
- Real-life Case Studies
31. Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?
A convex function is one where a line drawn between any two points on the graph lies on or above the graph. It has one minimum.
A non-convex function is one where a line drawn between any two points on the graph may intersect other points on the graph. It is characterized as “wavy”.
When a cost function is non-convex, it meAns that there’s a likelihood that the function may find local minima instead of the global minimum, which is typically undesired in machine learning models from an optimization perspective.
32. Walk through the probability fundamentals
Eight rules of probability
Rule #1: For any event A, 0 ≤ P(A) ≤ 1; in other words, the probability of an event can range from 0 to 1.
Rule #2: The sum of the probabilities of all possible outcomes always equals 1.
Rule #3: P(not A) = 1 — P(A); This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A.
Rule #4: If A and B are disjoint events (mutually exclusive), then P(A or B) = P(A) + P(B); this is called the addition rule for disjoint events
Rule #5: P(A or B) = P(A) + P(B) — P(A and B); this is called the general addition rule.
Rule #6: If A and B are two independent events, then P(A and B) = P(A) * P(B); this is called the multiplication rule for independent events.
Rule #7: The conditional probability of event B given event A is P(B|A) = P(A and B) / P(A)
Rule #8: For any two events A and B, P(A and B) = P(A) * P(B|A); this is called the general multiplication rule
Factorial Formula: n! = n x (n -1) x (n — 2) x … x 2 x 1
Use when the number of items is equal to the number of places available.
Eg. Find the total number of ways 5 people can sit in 5 empty seats.
= 5 x 4 x 3 x 2 x 1 = 120
Fundamental Counting Principle (multiplication)
This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills.
Eg. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. The total number of combinations is = 5 x 4 x 3 = 60
Permutations: P(n,r)= n! / (n−r)!
This method is used when replacements are not allowed and order of item ranking matters.
Eg. A code has 4 digits in a particular order and the digits range from 0 to 9. How many permutations are there if one digit can only be used once?
P(n,r) = 10!/(10–4)! = (10x9x8x7x6x5x4x3x2x1)/(6x5x4x3x2x1) = 5040
Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]
This is used when replacements are not allowed and the order in which items are ranked does not matter.
Eg. To win the lottery, you must select the 5 correct numbers in any order from 1 to 52. What is the number of possible combinations?
C(n,r) = 52! / (52–5)!5! = 2,598,960
33.Describe Markov chains?
“A Markov chain is a mathematical system that experiences trAnsitions from one state to another according to certain probabilistic rules. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of trAnsitioning to any particular state is dependent solely on the current state and time elapsed.”
The actual math behind Markov chains requires knowledge on linear algebra and matrices, so I’ll leave some links below in case you want to explore this topic further on your own.
34. A box has 12 red cards and 12 black cards. Another box has 24 red cards and 24 black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?
The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.
Let’s say the first card you draw from each deck is a red Ace.
This meAns that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.
In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.
Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.
35. You are at a Casino and have two dice to play with. You win $10 every time you roll a 5. If you play till you win and then stop, what is the expected payout?
- Let’s assume that it costs $5 every time you want to play.
- There are 36 possible combinations with two dice.
- Of the 36 combinations, there are 4 combinations that result in rolling a five (see blue). This meAns that there is a 4/36 or 1/9 chance of rolling a 5.
- A 1/9 chance of winning meAns you’ll lose eight times and win once (theoretically).
- Therefore, your expected payout is equal to $10.00 * 1 — $5.00 * 9= -$35.00.
36. How can you tell if a given coin is biased?
This isn’t a trick question. The Answer is simply to perform a hypothesis test:
- 1. The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (p=0.5). The alternative hypothesis is that the coin is biased and p != 0.5.
- 2. Flip the coin 500 times.
- 3. Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics).
- 4. Compare against alpha (two-tailed test so 0.05/2 = 0.025).
- 5. If p-value > alpha, the null is not rejected and the coin is not biased.
If p-value < alpha, the null is rejected and the coin is biased.
37. Make an unfair coin fair
Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads.
P(heads) * P(tails) = P(tails) * P(heads)
This makes sense since each coin toss is an independent event. This meAns that if you get heads → heads or tails → tails, you would need to reflip the coin.
38. You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it’s raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in London.
You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
P(A) = probability of it raining = 25%
P(B) = probability of all 3 friends say that it’s raining
P(A|B) probability that it’s raining given they’re telling that it is raining
P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27
Step 1: Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27
Step 2: Solve for P(A|B)
P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)
P(A|B) = 8 / (8 + 3) = 8/11
Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.
39. You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.
Since these events are not independent, we can use the rule:
P(A and B) = P(A) * P(B|A) ,which is also equal to
P(not A and not B) = P(not A) * P(not B | not A)
P(note 4 and not yellow) = P(note 4) * P(not yellow | not 4)
P(note 4 and not yellow) = (36/39) * (27/36)
P(note 4 and not yellow) = 0.692
Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.
40. How do you assess the statistical significance of an insight?
You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.
41. Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.
3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).
It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.
42. What is the Central Limit Theorem? Explain it. Why is it important?
Statistics How To provides the best definition of CLT, which is:
“The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.” 
The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.
43. What is the statistical power?
‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true. 
44. Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
Selection bias is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.
Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.
Types of selection bias include:
- sampling bias: a biased sample caused by non-random sampling
- time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
- exposure: includes clinical susceptibility bias, protopathic bias, indication bias.
- data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
- attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
- observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. 
Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you are adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.
45.Provide a simple example of how an experimental design can help Answer a question about behavior. How does experimental data contrast with observational data?
Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.
46. Is mean imputation of missing data acceptable practice? Why or why not?
Mean imputation is the practice of replacing null values in a data set with the mean of the data.
Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.
Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.
47.What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.
An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.
An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.
48. How do you handle missing data? What imputation techniques do you recommend?
There are several ways to handle missing data:
- Delete rows with missing data
- Mean/Median/Mode imputation
- Assigning a unique value
- Predicting the missing values
- Using an algorithm which supports missing values, like random forests
The best method is to delete rows with missing data as it ensures that no bias or variance is added or removed, and ultimately results in a robust and accurate model. However, this is only recommended if there’s a lot of data to start with and the percentage of missing values is low.
49. You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?
First I would conduct EDA — Exploratory Data Analysis to clean, explore, and understand my data. See my article on EDA here. As part of my EDA, I could compose a histogram of the duration of calls to see the underlying distribution.
My guess is that the duration of calls would follow a lognormal distribution (see below). The reason that I believe it’s positively skewed is because the lower end is limited to 0 since a call can’t be negative seconds. However, on the upper end, it’s likely for there to be a small proportion of calls that are extremely long relatively.
You could use a QQ plot to confirm whether the duration of calls follows a lognormal distribution or not.
50. Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?
Administrative datasets are typically datasets used by governments or other organizations for non-statistical reasons.
Administrative datasets are usually larger and more cost-efficient than experimental studies. They are also regularly updated assuming that the organization associated with the administrative dataset is active and functioning. At the same time, administrative datasets may not capture all of the data that one may want and may not be in the desired format either. It is also prone to quality issues and missing entries.
Learn On-Demand Data Science Course from Real Time ExpertsWeekday / Weekend BatchesSee Batch Details
51.You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?
There are a number of potential reasons for a spike in photo uploads:
- 1. A new feature may have been implemented in October which involves uploading photos and gained a lot of traction by users. For example, a feature that gives the ability to create photo albums.
- 2. Similarly, it’s possible that the process of uploading photos before was not intuitive and was improved in the month of October.
- 3. There may have been a viral social media movement that involved uploading photos that lasted for all of October. Eg. Movember but something more scalable.
- 4. It’s possible that the spike is due to people posting pictures of themselves in costumes for Halloween.
The method of testing depends on the cause of the spike, but you would conduct hypothesis testing to determine if the inferred cause is the actual cause.
52. Give examples of data that does not have a Gaussian distribution, nor log-normal.
- Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
- Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.
53. What is root cause analysis? How to identify a cause vs. a correlation? Give examples
Ans:Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem 
Correlation measures the relationship between two variables, ranging from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.
You can test for causation using hypothesis testing or A/B testing.
54. Give an example where the median is a better measure than the mean
When there are a number of outliers that positively or negatively skew the data.
55. Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12
There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36
56.What is the Law of Large Numbers?
The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.
57. How do you calculate the needed sample size?
Ans:You can use the margin of error (ME) formula to determine the desired sample size.
- t/z = t/z score used to calculate the confidence interval
- ME = the desired margin of error
- S = sample standard deviation
58. When you sample, what bias are you inflicting?
Potential biases include the following:
- Sampling bias: a biased sample caused by non-random sampling
- Under coverage bias: sampling too few observations
- Survivorship bias: error of overlooking observations that did not make it past a form of selection process.
59. How do you control for biases?
There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.
60. What are confounding variables?
A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.
61. What is A/B testing?
A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.
62. How do you prove that males are on average taller than females by knowing just gender height?
You can use hypothesis testing to prove that males are taller on average than females.
The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.
Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.
63. Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.
Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.
The probability of observing k events in an interval
Null (H0): 1 infection per person-days
Alternative (H1): >1 infection per person-days
k (actual) = 10 infections
lambda (theoretical) = (1/100)*1787
p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R
Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.
64.You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?
Use the General Binomial Probability formula to Answer this question:
General Binomial Probability Formula
p = 0.8
n = 5
k = 3,4,5
P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94 or 94%
65.A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)
p =1-norm.dist(1200, 1020, 50, true)
66. Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?
x = 3
mean = 2.5*4 = 10
p = poisson.dist(3,10,true)
p = 0.010336
67.An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?
Equation for Precision (PV)
Precision = Positive Predictive Value = PV
PV = (0.001*0.997)/[(0.001*0.997)+((1–0.001)*(1–0.985))]
PV = 0.0624 or 6.24%
68. You are running for office and your pollster polled hundreds of people. Sixty of them claimed they would vote for you. Can you relax?
- Assume that there’s only you and one other opponent.
- Also, assume that we want a 95% confidence interval. This gives us a z-score of 1.96.
Confidence interval formula
p-hat = 60/100 = 0.6
z* = 1.96
n = 100
This gives us a confidence interval of [50.4,69.6]. Therefore, given a confidence interval of 95%, if you are okay with the worst scenario of tying then you can relax. Otherwise, you cannot relax until you got 61 out of 100 to claim yes.
69. Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.
- Since this is a Poisson distribution question, mean = lambda = variance, which also meAns that standard deviation = square root of the mean
- a 95% confidence interval implies a z score of 1.96
- one standard deviation = 10
Therefore the confidence interval = 100 +/- 19.6 = [964.8, 1435.2]
70. The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?
- Since this is a Poisson distribution question, mean = lambda = variance, which also meAns that standard deviation = square root of the mean
- a 95% confidence interval implies a z score of 1.96
- one standard deviation = sqrt(115) = 10.724
Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.
71. Consider influenza epidemics for two-parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?
Using the General Addition Rule in probability:
P(mother or father) = P(mother) + P(father) — P(mother and father)
P(mother) = P(mother or father) + P(mother and father) — P(father)
P(mother) = 0.17 + 0.06–0.12
P(mother) = 0.11
72. Suppose that diastolic blood pressures (DBPs) for men aged 35–44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. What is the probability that a random 35–44 year old has a DBP less than 70?
Since 70 is one standard deviation below the mean, take the area of the Gaussian distribution to the left of one standard deviation.
= 2.3 + 13.6 = 15.9%
73. In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?
Confidence interval for sample
Given a confidence level of 95% and degrees of freedom equal to 8, the t-score = 2.306
Confidence interval = 1100 +/- 2.306*(30/3)
Confidence interval = [1076.94, 1123.06]
74. A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up — baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?
Upper bound = mean + t-score*(standard deviation/sqrt(sample size))
0 = -2 + 2.306*(s/3)
2 = 2.306 * s / 3
s = 2.601903
Therefore the standard deviation would have to be at least approximately 2.60 for the upper bound of the 95% T confidence interval to touch 0.
75.In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System — Old System).
Confidence Interval = mean +/- t-score * standard error (see above)
mean = new mean — old mean = 3–5 = -2
t-score = 2.101 given df=18 (20–2) and confidence interval of 95%
standard error = sqrt((0.⁶²*9+0.⁶⁸²*9)/(10+10–2)) * sqrt(1/10+1/10)
standard error = 0.352
confidence interval = [-2.75, -1.25]
76. To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)
Assuming we subtract in this order (New System — Old System):
confidence interval formula for two independent samples
mean = new mean — old mean = 4–6 = -2
z-score = 1.96 confidence interval of 95%
st. error = sqrt((0.⁵²*99+²²*99)/(100+100–2)) * sqrt(1/100+1/100)
standard error = 0.205061
lower bound = -2–1.96*0.205061 = -2.40192
upper bound = -2+1.96*0.205061 = -1.59808
confidence interval = [-2.40192, -1.59808]
77.Write a SQL query to get the second highest salary from the Employee table. For example, given the Employee table below, the query should return 200 as the second highest salary. If there is no second highest salary, then the query should return null.
| Id | Salary |
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
Using IFNULL, OFFSET
- IFNULL(expression, alt) : ifnull() returns the specified value if null, otherwise returns the expected value. We’ll use this to return null if there’s no second-highest salary.
- OFFSET : offset is used with the ORDER BY clause to disregard the top n rows that you specify. This will be useful as you’ll want to get the second row (2nd highest salary)
- (SELECT DISTINCT Salary
- FROM Employee
- ORDER BY Salary DESC
- LIMIT 1 OFFSET 1
- ), null) as SecondHighestSalary
- FROM Employee LIMIT 1
SOLUTION B: Using MAX()
This query says to choose the MAX salary that isn’t equal to the MAX salary, which is equivalent to saying to choose the second-highest salary!
- SELECT MAX(salary) AS SecondHighestSalary
- FROM Employee
- WHERE salary != (SELECT MAX(salary) FROM Employee)
78.Write a SQL query to find all duplicate emails in a table named Person.
| Id | Email |
| 1 | email@example.com |
| 2 | firstname.lastname@example.org |
| 3 | email@example.com |
COUNT() in a Subquery
First, a subquery is created to show the count of the frequency of each email. Then the subquery is filtered WHERE the count is greater than 1.
- SELECT Email
- FROM (
- SELECT Email, count(Email) AS count
- FROM Person
- GROUP BY Email
- ) as email_count
- WHERE count > 1
SOLUTION B: HAVING Clause
- HAVING is a clause that essentially allows you to use a WHERE statement in conjunction with aggregates (GROUP BY).
- SELECT Email
- FROM Person
- GROUP BY Email
- HAVING count(Email) > 1
79. Given a Weather table, write a SQL query to find all dates’ Ids with higher temperature compared to its previous (yesterday’s) dates.
| Id(INT) | RecordDate(DATE) | Temperature(INT) |
| 1 | 2015-01-01 | 10 |
| 2 | 2015-01-02 | 25 |
| 3 | 2015-01-03 | 20 |
| 4 | 2015-01-04 | 30 |
- DATEDIFF calculates the difference between two dates and is used to make sure we’re comparing today’s temperature to yesterday’s temperature.
In plain English, the query is saying, Select the Ids where the temperature on a given day is greater than the temperature yesterday.
- SELECT DISTINCT a.Id
- FROM Weather a, Weather b
- WHERE a.Temperature > b.Temperature
- AND DATEDIFF(a.Record Date, b.Record Date) = 1
The Employee table holds all employees. Every employee has an Id, a salary, and there is also a column for the department Id.
| Id | Name | Salary | DepartmentId |
| 1 | Joe | 70000 | 1 |
| 2 | Jim | 90000 | 1 |
| 3 | Henry | 80000 | 2 |
| 4 | Sam | 60000 | 2 |
| 5 | Max | 90000 | 1 |
The Department table holds all departments of the company.
| Id | Name |
| 1 | IT |
| 2 | Sales |
80.Write a SQL query to find employees who have the highest salary in each of the departments. For the above tables, your SQL query should return the following rows (order of rows does not matter).
| Department | Employee | Salary |
| IT | Max | 90000 |
| IT | Jim | 90000 |
| Sales | Henry | 80000 |
- 1. The IN clause allows you to use multiple OR clauses in a WHERE statement. For example WHERE country = ‘Canada’ or country = ‘USA’ is the same as WHERE country IN (‘Canada’, ’USA’).
- 2. In this case, we want to filter the Department table to only show the highest Salary per Department (i.e. DepartmentId). Then we can join the two tables WHERE the DepartmentId and Salary is in the filtered Department table.
- Department.name AS ‘Department’,
- Employee.name AS ‘Employee’,
- FROM Employee
- INNER JOIN Department ON Employee.DepartmentId = Department.Id
- WHERE (DepartmentId , Salary)
- ( SELECT
- DepartmentId, MAX(Salary)
- GROUP BY DepartmentId
81.Mary is a teacher in a middle school and she has a table storing students’ names and their corresponding seat ids. The column id is a continuous increment. Mary wants to change seats for the adjacent students.
Can you write a SQL query to output the result for Mary?
| id | student |
| 1 | Abbot |
| 2 | Doris |
| 3 | Emerson |
| 4 | Green |
| 5 | Jeames |
For the sample input, the output is:
| id | student |
| 1 | Doris |
| 2 | Abbot |
| 3 | Green |
| 4 | Emerson |
| 5 | Jeames |
If the number of students is odd, there is no need to change the last one’s seat.
- 1. Think of a CASE WHEN THEN statement like an IF statement in coding.
- 2. The first WHEN statement checks to see if there’s an odd number of rows, and if there is, ensure that the id number does not change.
- 3. The second WHEN statement adds 1 to each id (eg. 1,3,5 becomes 2,4,6)
- 4. Similarly, the third WHEN statement subtracts 1 to each id (2,4,6 becomes 1,3,5)
- WHEN((SELECT MAX(id) FROM seat)%2 = 1) AND id = (SELECT MAX(id) FROM seat) THEN id
- WHEN id%2 = 1 THEN id + 1
- ELSE id – 1
- END AS id, student
- FROM seat
- ORDER BY id
82.If there are 8 marbles of equal weight and 1 marble that weighs a little bit more (for a total of 9 marbles), how many weighings are required to determine which marble is the heaviest?
Two weighings would be required (see part A and B above):
- 1. You would split the nine marbles into three groups of three and weigh two of the groups. If the scale balances (alternative 1), you know that the heavy marble is in the third group of marbles. Otherwise, you’ll take the group that is weighed more heavily (alternative 2).
- 2. Then you would exercise the same step, but you’d have three groups of one marble instead of three groups of three.
83. How would the change of prime membership fee affect the market?
I’m not 100% sure about the Answer to this question but will give my best shot!
Let’s take the instance where there’s an increase in the prime membership fee — there are two parties involved, the buyers and the sellers.
For the buyers, the impact of an increase in a prime membership fee ultimately depends on the price elasticity of demand for the buyers. If the price elasticity is high, then a given increase in price will result in a large drop in demand and vice versa. Buyers that continue to purchase a membership fee are likely Amazon’s most loyal and active customers — they are also likely to place a higher emphasis on products with prime.
Sellers will take a hit, as there is now a higher cost of purchasing Amazon’s basket of products. That being said, some products will take a harder hit while others may not be impacted. It is likely that premium products that Amazon’s most loyal customers purchase would not be affected as much, like electronics.
84. If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?
There are a number of possible variables that can cause such a discrepancy that I would check to see:
- The demographics of iOS and Android users might differ significantly. For example, according to Hootsuite, 43% of females use Instagram as opposed to 31% of men. If the proportion of female users for iOS is significantly larger than for Android then this can explain the discrepancy (or at least a part of it). This can also be said for age, race, ethnicity, location, etc…
- Behavioral factors can also have an impact on the discrepancy. If iOS users use their phones more heavily than Android users, it’s more likely that they’ll indulge in Instagram and other apps than someone who spent significantly less time on their phones.
- Another possible factor to consider is how Google Play and the App Store differ. For example, if Android users have significantly more apps (and social media apps) to choose from, that may cause greater dilution of users.
- Lastly, any differences in the user experience can deter Android users from using Instagram compared to iOS users. If the app is more buggy for Android users than iOS users, they’ll be less likely to be active on the app.
85. Likes/users and minutes spent on a platform are increasing but total number of users are decreasing. What could be the root cause of it?
Generally, you would want to probe the interviewer for more information but let’s assume that this is the only information that he/she is willing to give.
Focusing on likes per user, there are two reasons why this would have gone up. The first reason is that the engagement of users has generally increased on average over time — this makes sense because as time passes, active users are more likely to be loyal users as using the platform becomes a habitual practice. The other reason why likes per user would increase is that the denominator, the total number of users, is decreasing. Assuming that users that stop using the platform are inactive users, aka users with little engagement and fewer likes than average, this would increase the average number of likes per user.
The explanation above can also be applied to minutes spent on the platform. Active users are becoming more engaged over time, while users with little usage are becoming inactive. Overall the increase in engagement outweighs the users with little engagement.
To take it a step further, it’s possible that the ‘users with little engagement’ are bots that Facebook has been able to detect. But over time, Facebook has been able to develop algorithms to spot and remove bots. If there were a significant number of bots before, this can potentially be the root cause of this phenomenon.
86. Facebook sees that likes are up 10% year over year, why could this be?
The total number of likes in a given year is a function of the total number of users and the average number of likes per user (which I’ll refer to as engagement).
Some potential reasons for an increase in the total number of users are the following: users acquired due to international expAnsion and younger age groups signing up for Facebook as they get older.
Some potential reasons for an increase in engagement are an increase in usage of the app from users that are becoming more and more loyal, new features and functionality, and an improved user experience.
87.If we were testing product X, what metrics would you look at to determine if it is a success?
The metrics that determine a product’s success are dependent on the business model and what the business is trying to achieve through the product. The book Lean analytics lays out a great framework that one can use to determine what metrics to use in a given scenario:
Framework from Lean Analytics
88.If a PM says that they want to double the number of ads in News Feed, how would you figure out if this is a good idea or not?
You can perform an A/B test by splitting the users into two groups: a control group with the normal number of ads and a test group with double the number of ads. Then you would choose the metric to define what a “good idea” is. For example, we can say that the null hypothesis is that doubling the number of ads will reduce the time spent on Facebook and the alternative hypothesis is that doubling the number of ads won’t have any impact on the time spent on Facebook. However, you can choose a different metric like the number of active users or the churn rate. Then you would conduct the test and determine the statistical significance of the test to reject or not reject the null.
89. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
Ans:Lift: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.
KPI: stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate.
Robustness: generally robustness refers to a system’s ability to handle variability and remain effective.
Model fitting: refers to how well a model fits a set of observations.
Design of experiments: also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable.  In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).
80/20 rule: also known as the Pareto principle; states that 80% of the effects come from 20% of the causes. Eg. 80% of sales come from 20% of customers.
90. Define quality assurance, six sigma.
Ans:Quality assurance: an activity or set of activities focused on maintaining a desired level of quality by minimizing mistakes and defects.
Six sigma: a specific type of quality assurance methodology composed of a set of techniques and tools for process improvement. A six sigma process is one in which 99.99966% of all outcomes are free of defects.
91. What is the difference between supervised and unsupervised machine learning?
Supervised Machine learning : Supervised machine learning requires training labelled data. Let’s discuss it in bit detail, when we have
Unsupervised Machine learning :Unsupervised machine learning doesn’t require labelled data.
92. What is bias, variance trade off ?
“Bias is error introduced in your model due to oversimplification of machine learning algorithm.” It can lead to under fitting. When you train your model at that time the model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms:
- 1. Decision Trees
- 2. k-NN and SVM
High bias machine learning algorithms
- 1. Linear Regression
- 2. Logistic Regression
“Variance is an error introduced in your model due to a complex machine learning algorithm, your model learns noise also from the training data set and performs badly on the test data set.” It can lead to high sensitivity and over fitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias, Variance trade off:
The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
- 1. The k-nearest neighbours algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.
- 2. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.
93. What are exploding gradients ?
Gradient is the direction and magnitude calculated during training of a neural network that is used to update the network weights in the right direction and by the right amount.
“Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.” At an extreme, the values of weights can become so large as to overflow and result in NaN values.
This has the effect of your model being unstable and unable to learn from your training data. Now let’s understand what is the gradient.
94. What is a confusion matrix ?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix
A data set used for performance evaluation is called a test data set. It should contain the correct labels and predicted labels.
The predicted labels will be exactly the same if the performance of a binary classifier is perfect.
The predicted labels usually match with part of the observed labels in real world scenarios.
A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes-
- 1. True positive(TP) — Correct positive prediction
- 2. False positive(FP) — Incorrect positive prediction
- 3. True negative(TN) — Correct negative prediction
- 4. False negative(FN) — Incorrect negative prediction
Basic measures derived from the confusion matrix
- 1. Error Rate = (FP+FN)/(P+N)
- 2. Accuracy = (TP+TN)/(P+N)
- 3. Sensitivity(Recall or True positive rate) = TP/P
- 4. Specificity(True negative rate) = TN/N
- 5. Precision(Positive predictive value) = TP/(TP+FP)
- 6. F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC) where b is commonly 0.5, 1, 2.
95. Explain how a ROC curve works ?
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate.
96. What is selection Bias ?
Selection bias occurs when sample obtained is not representative of the population intended to be analysed.
97. Explain SVM machine learning algorithm in detail.
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.
98.What are support vectors in SVM.
In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.
99. What are the different kernel functions in SVM ?
There are four types of kernels in SVM.
- 1. Linear Kernel
- 2. Polynomial kernel
- 3. Radial basis kernel
- 4. Sigmoid kernel
100. What is selection bias?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.
Are you looking training with Right Jobs?Contact Us
- Artificial Intelligence Tutorial
- Machine Learning Tutorial
- Python Interview Questions and Answers
- Machine Learning Algorithms for Data Science Tutorial
- Data Science with Python Interview Questions and Answers
- What is Dimension Reduction? | Know the techniques
- Difference between Data Lake vs Data Warehouse: A Complete Guide For Beginners with Best Practices
- What is Dimension Reduction? | Know the techniques
- What does the Yield keyword do and How to use Yield in python ? [ OverView ]
- Agile Sprint Planning | Everything You Need to Know