SAS predictive modeling interview questions LEARNOVITA

SAS Predictive Modeling Interview Questions and Answers [ FRESHERS ]

Last updated on 23rd Sep 2022, Blog, Interview Question

About author

Sanjay (Sr Big Data DevOps Engineer )

Highly Expertise in Respective Industry Domain with 7+ Years of Experience Also, He is a Technical Blog Writer for Past 4 Years to Renders A Kind Of Informative Knowledge for JOB Seeker

(5.0) | 13265 Ratings 1509

1.What is a Predictive Modelling?

Ans:

Predictive modeling knowledge is a one of the most sought-after skill today. It is demand these days. It is being used in almost of every domain ranging from a finance, retail to manufacturing. It is being looked as method of solving a complex business problems. It helps to grow a businesses e.g. predictive acquisition model, optimization engine to solve a network problem etc.

2. What are essential steps in the predictive modeling project?

Ans:

  • Establish a business objective of a predictive model.
  • Pull a Historical Data – Internal and External.
  • Select a Observation and Performance Window.
  • Create a newly derived variables.
  • Split Data into the Training, Validation and Test Samples.
  • Clean Data – Treatment of Missing.
  • Values and Outliers.
  • Variable Reduction / Selection.
  • Variable Transformation.
  • Develop Model.
  • Validate Model.
  • Check Model Performance.
  • Deploy Model.
  • Monitor Model

3.Explain problem statement of a project.

Ans:

A problem statement is usually one or two sentences to define the problem of process improvement project will address. In general, a problem statement will outline a negative points of the current situation and explain why this matters.

4.Difference between the Linear and Logistic Regression?

Ans:

Linear regression needs the dependent variable to be a continuous i.e. numeric values ,While Binary logistic regression requires a dependent variable to be binary – two categories only (0/1). Multinomial or ordinary logistic regression can have a dependent variable with more than two categories.

Linear regression is based on the least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of a squared distances of each observed response to its fitted value. While the logistic regression is based on a Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes a Probability of Y given X (likelihood).

5.How to treat the outliers?

Ans:

  • Percentile Capping.
  • Box-Plot Method.
  • Mean plus minus 3 Standard Deviation.
  • Weight of Evidence.

6.What is a multi co-linearity and how to deal it?

Ans:

Multi co-linearity implies a high correlation between independent variables. It is one of assumptions in linear and also logistic regression. It can be identified by looking at a VIF score of variables. VIF > 2.5 implies moderate to co-linearity issue. VIF >5 is considered as a high co-linearity.It can be handled by an iterative process: first step – remove a variable having highest VIF and then check VIF of the remaining variables. If VIF of remaining variables > 2.5, then follow a same first step until VIF < =2.5.

7.Explain the co-linearity between the continuous and categorical variables?

Ans:

Co-linearity between the categorical and continuous variables is very common. The choice of a reference category for dummy variables affects multi co-linearity. It means changing a reference category of dummy variables can avoid co-linearity. Pick reference category with the highest proportion of cases.

8.What are applications of a predictive modeling?

Ans:

  • Acquisition – a Cross Sell / Up Sell.
  • Retention – Predictive Attrition Model.
  • Customer Lifetime Value Model.
  • Next Best Offer.
  • Market Mix Model.
  • Pricing Model.
  • Campaign Response Model.
  • Probability of Customers defaulting on loan.
  • Segment customers based on homogenous attributes.
  • Demand Forecasting.
  • Usage Simulation.
  • Underwriting.
  • Optimization – Optimize Network.

9.Is VIF a correct method to compute a co-linearity in this case?

Ans:

VIF is not correct method in this case. VIFs should only be run for the continuous variables. The t-test method can be used to check a co-linearity between continuous and dummy variable.

10.Difference between the Factor Analysis and PCA?

Ans:

  • In a Principal Components Analysis, the components are calculated as a linear combinations of the original variables. In Factor Analysis, the original variables are explained as a linear combinations of the factors.
  • Principal Components Analysis is used as variable reduction technique whereas Factor Analysis is used to understand what constructs underlie a data.
  • In a Principal Components Analysis, the goal is to explain as much of total variance in the variables as possible. The goal in a Factor Analysis is to explain a co-variances or correlations between variables.

11.What are effective measures in Predictive Modeling project?

Ans:

  • Set up a business target.
  • Examine a historical data both internal and external.
  • Adopt Observation and Performance window.
  • Make recently finding factors Categorize Data into training, validation, and test Samples.
  • Clean Data – Treatment of a missing values and outliers.
  • Variable reduction/selection.
  • Variable transformation.
  • Create model.
  • Approve model.
  • Check model performance.

12.Contrast amongst the Linear and Logistic Regression?

Ans:

In Linear Regression, a dependent variable needs to be a continuous i.e. without any breakage. The variable can be as big as a possible as long as it’s not split. Whereas in a Logistic Regression, a dependent variable ought to be binary i.e. either 0 or 1. Although a dependent variable can have more than a two categories when it comes to the multinomial or the ordinary logistic regression.

Least Square Estimation takes a prominence in Linear Regression, which is basically whatever the coefficients are chosen should minimize the sum of a squared distances. Maximum Likelihood Estimation for the Logistic Regression on the other hand prompts that are coefficients chosen should yield in a maximum probability for Y given X.

13.What is mean by a SAP security?

Ans:

SAP security is a furnishing the right access to business clients as for their power or duty and giving authorization as of indicated by their parts.

14.Clarify what is “roles” in SAP security?

Ans:

“Roles” is be alluded to a gathering of t-codes, which is appointed to the execute specific business errands. Every part in a SAP requires specific benefits to executing a capacity in a SAP that is called AUTHORIZATIONS.

15.What are pre-requirements that ought to be taken before allotting Sap_all to a client even if there is a nod from authority?

Ans:

  • Empowering a review log-utilizing sm 19 tcode.
  • Recovering a of review log-utilizing sm 20 tcode.

16.Clarify what is a SOD in SAP Security?

Ans:

SOD implies Segregation of Duties; it is executed in a SAP so as to recognize and avert blunder or misrepresentation amid a business exchange. For instance, if a client or worker has benefit to get to financial balance detail and installment run, it may be a conceivable that it can redirect seller installments to his own record.

17.What are the role templates used for?

Ans:

Role templates comprise the transactions, web addresses, and reports. These are predefined activity bots in a SAP.

18.Is it conceivable to change a role template? How?

Ans:

  • Indeed, and can change a client role template. There are precisely three manners by which can work with a client role templates.
  • Can utilize it as they are conveyed in a sap.
  • Can alter them according to the requirements through pfcg.
  • Can make them without any preparation.
  • For all above indicated need to utilize a pfcg exchange to look after them.

19.What is the user type for background jobs user?

Ans:

  • System User.
  • Communication User.

20.Clarify the Important Model Performance Statistics?

Ans:

  • AUC > 0.7. No critical contrast between a AUC score of training versus validation.
  • KS ought to be in a top 3 deciles and it ought to be more than a 30 Rank Ordering. No break-in rank requesting.
  • Same indications of a parameter evaluation in both the preparing and approval.

21.What is a P-value and how it is used for the variable selection?

Ans:

The p-value is most reduced level of criticalness at which and can dismiss an invalid hypothesis. On account of an independent factors, it implies whether a coefficient of a variable is altogether not quite a same as zero.

22.Clarify a problem statement of the project. What are financial impacts of it?

Ans:

Cover the target or a fundamental objective of a predictive model. Look at fiscal advantages of predictive model versus the No-model. Additionally features a non-fiscal advantages (assuming any).

23.State difference between the derived role and single role.

Ans:

The t-codes can be added or deleted for the single role whereas, a derived role cannot facilitate that.

24.Clarify what is the authorization object class and authorization object?

Ans:

Authorization of Object Class : Authorization object falls under an Authorization object classes, and they are gathered by a work a territory like HR, bookkeeping, back, etc.

Authorization Object : Authorization objects are gathered from an authorization field that oversees a specific development. Authorization identifies a specific activity while an Authorization field relates for a security administrators to arrange a specific characteristics in that particular activity.

25.Clarify what is a PFCG_Time_Dependency?

Ans:

PFCG_TIME_DEPENDENCY is a report that is utilized for the client ace examination. It additionally clears up a terminated profiles from the client’s ace record. To straightforwardly execute this report to PFUD exchange code can likewise be utilized.

26.Mention a two tables authorization objects need in order to be maintained?

Ans:

  • USOBT
  • USOBX

27.How can lock all users simultaneously in SAP?

Ans:

All the users in a SAP can be locked simultaneously by running a EWZ5 t-code.

28.How to handle a missing values?

Ans:

  • Fill /impute lacking values of usage of the following strategies. Or make a lacking values as a separate class.
  • Mean Imputation for a Continuous Variables (No Outlier).
  • Median Imputation for a Continuous Variables (If Outlier).
  • Cluster Imputation for a Continuous Variables.

29.How Vif is a calculated and interpretation of it?

Ans:

VIF measures how an awful lot of variance of an anticipated regression coefficient is be expanded because of collinearity. If VIF of a predictor variable were nine (√9 = 3) which means that are usual blunders for coefficient of that predictor variable is 3 instances as a huge as it might be if that predictor variable have been uncorrelated with an alternative predictor variables.Steps of a calculating VIF VIF run linear regression in which one of an impartial variable is taken into a consideration as goal variable and all the various impartial variables considered as independent variables are Calculate VIF of the variable. VIF = 1/(1-RSquared).

30.Do Remove the intercepts while calculating Vif?

Ans:

No. VIF depends on intercept due to the fact there is an intercept within a regression used to determine VIF. If intercept is eliminated, R-rectangular isn’t meaningful because it can be terrible in which case will get VIF < 1, implying that the standard error of variable would go up if that independent of variable were uncorrelated with the other predictors.

31.Explain collinearity between the continuous and categorical variables. Is Vif a correct method to compute collinearity in this case?

Ans:

Collinearity among categorical and non-stop variables may be a very commonplace. The choice of a reference class for dummy variables influences multicollinearity. It method changing reference class of dummy variables can keep away from the collinearity. Pick a reference category with the highest share of cases.

32.List down a reasons for choosing SAS over the other data analytics tools.

Ans:

Will compare a SAS with popular alternatives in a market based on the following aspects:

  • It includes everything.
  • Simple to use.
  • Industry knowledge.
  • Better algorithm testing and data security.
  • Key conclusions.

33.What is a SAS?

Ans:

SAS (Statistical Analytics System).SAS is software suite for advanced analytics, multivariate analyses, business intelligence, data management and predictive analytics. It is developed by a SAS Institute.SAS provides the graphical point-and-click user interface for non-technical users and more advanced options through a SAS language.

34.What are features of a SAS?

Ans:

Business Solutions: SAS provides a business analysis that can be used as a business products for various companies to use.

Analytics: SAS is a market leader in the analytics of different business products and services.

Data Access & Management: SAS can also be use as DBMS software.

Reporting & Graphics: Hello SAS helps to visualize a analysis in a form of summary, lists and graphic reports.

Visualization: Can visualize the reports in the form of a graphs ranging from simple scatter plots and bar charts to difficult multi-page classification panels.

35.Mention a few capabilities of SAS Framework.

Ans:

Access: SAS allows us to access a data from the multiple sources like an Excel file, raw database, Oracle database and SAS Datasets.

Manage: Can then manage this data to a subset data, create variables, validate and clean data.

Analyze: Further, analysis happens on this data. And can perform simple analyses like a frequency and averages and complex analyses including regression and forecasting. SAS is a gold standard for statistical analyses.

Present: Finally can present a analysis in the form of list, summary and graphic reports. And can either print these reports, write them to data file or publish them online.

36.What is function of output statement in SAS Program?

Ans:

  • Can use a OUTPUT statement to save a summary statistics in a SAS data set. This information can then be used to create a customized reports or to save historical information about a process.
  • Specify a statistics to save in output data set.
  • Specify a name of output data set, and
  • Compute and save a percentiles not automatically computed by a CAPABILITY procedure.

37.What is function of Stop statement in SAS Program?

Ans:

Stop statement causes a SAS to stop processing a current data step immediately and resume processing statement after end of current data step.

38.What is difference between the using drop = data set option in data statement and set statement?

Ans:

  • If don’t want to process certain variables and and do not want them to appear in a new data set, then specify drop = data set option in a set statement.
  • Whereas If need to process certain variables and do not want them to appear in a new data set, then specify drop = data set option in a data statement.

39.Given an unsorted data set, how to read a last observation to a new data set?

Ans:

  • data work.calculus;
  • set work.comp end=last;
  • If last;
  • run;

Where calculus is the new data set to be created and comp is an existing data set. last is a temporary variable (initialized to 0) which is set to 1 when set statement reads a last observation.

40.What is difference between the reading data from an external file and reading data from an existing data set?

Ans:

The main difference is that while reading an existing data set with a SET statement, SAS retains the values of variables from one observation to the next. Whereas when reading a data from an external file, only observations are read. The variables will have to a re-declared if they need to be used.

41.How many data types are there in a SAS?

Ans:

There are two data types in a SAS. Character and Numeric. Apart from this, dates are also considered as a characters although there are implicit functions to work upon a dates.

42.What is difference between a SAS functions and procedures?

Ans:

Functions are expect argument values to be supplied across an observation in the SAS data set whereas a procedure expects one variable value per observation. For example:

  • data average ;
  • set temp ;
  • avgtemp = mean( of T1 – T24 ) ;
  • run ;

43.What are differences between the sum function and using “+” operator?

Ans:

SUM function returns a sum of non-missing arguments whereas “+” operator returns missing value if any of the arguments are missing.

44.What are differences between the PROC MEANS and PROC SUMMARY?

Ans:

    PROC MEANSPROC SUMMARY
    PROC MEANS produces a subgroup statistics only when a BY statement is used and also input data has been previously sorted (using PROC SORT) by a BY variables. PROC SUMMARY automatically produces a statistics for all subgroups, giving all the information in one run that would get by repeatedly sorting a data set by a variables that define each subgroup and running PROC MEANS. PROC SUMMARY does not produce any information in a output. So will need to use a OUTPUT statement to create a new DATA SET and use a PROC PRINT to see a computed statistics.

45.Give an example where a SAS fails to convert a character value to numeric value automatically?

Ans:

Suppose value of variable PayRate begins with dollar sign ($). When SAS tries to automatically convert a values of PayRate to numeric values, the dollar sign blocks process. The values cannot be converted to a numeric values. Therefore, it is always best to include an INPUT and PUT functions in a programs when conversions occur.

46.How do delete a duplicate observations in SAS?

Ans:

By using a nodups in a procedure:

  • Proc sort data=SAS-Dataset nodups;
  • by var;
  • run;

By using SQL query inside a procedure:

  • Proc sql;
  • Create a SAS-Dataset as select * from Old-SAS-Dataset where var=distinct(var);
  • quit;

By cleaning a data:

  • Set temp;
  • By group;
  • If first.group and last.group then
  • Run;

47.How does a PROC SQL work?

Ans:

PROC SQL is simultaneous process for all observations. The following steps happen when a PROC SQL is executed:

  • SAS scans every statement in the SQL procedure and check syntax errors, such as a missing semicolons and invalid statements.
  • SQL optimizer scans a query inside the statement. The SQL Optimizer decides how SQL query should be executed in order to a minimize run time.
  • Any tables in a FROM statement are loaded into a data engine where they can then be accessed in memory.
  • Code and Calculations are be executed.
  • Final Table is created in a memory.
  • Final Table is sent to a output table described in SQL statement.

48.Briefly explain a Input and Put function?

Ans:

Input function : Character to a numeric conversion- Input(source,informat).

Put function : Numeric to the character conversion- put(source,format).

49.What would be a result of the following SAS function (given that 31 Dec, 2000 is Sunday)?

Ans:

  • Weeks = intck (‘week’,’31 dec 2000’d,’01jan2001’d);
  • Years = intck (‘year’,’31 dec 2000’d,’01jan2001’d);
  • Months = intck (‘month’,’31 dec 2000’d,’01jan2001’d);

Here, will calculate a weeks between 31st December, 2000 and 1st January, 2001. 31st December 2000 was a Sunday. So 1st January 2001 will be a Monday in a same week. Hence, Weeks = 0

  • Years = 1, since both days are in various calendar years.
  • Months = 1 ,since both the days are in various months of the calendar.

50. What are applications of predictive Modelling?

Ans:

  • Customer targeting.
  • Churn prevention.
  • Sales forecasting.
  • Market analysis.
  • Risk assessment.
  • Financial modeling.

51.What is length assigned to a target variable by the scan function?

Ans:

200 is a length assigned to the target variable by a scan function.

52.Name a few SAS functions?

Ans:

A Scan, Substr, trim, Catx, an Index, tranwrd, find, Sum.

53.What is work of tranwrd function?

Ans:

TRANWRD function replaces or removes all the occurrences of a pattern of characters within character string.

54.What are four primary aspects of predictive analytics?

Ans:

  • Data Sourcing.
  • Data Utility.
  • Deep Learning, Machine Learning, and Automation.
  • Objectives and Usage.

55.How do use a do loop if you don’t know how many times should execute a do loop?

Ans:

Can use ‘do until’ or ‘do while’ to specify a condition.

56.How do dates work in a SAS data?

Ans:

  • Data is central to the every data set. In SAS, data is available in a tabular form where variables occupy a column space and observations occupy the row space.
  • SAS treats numbers as a numeric data and everything else falls under the character data. SAS has a two data types numeric and character.
  • Apart from these, dates in a SAS are represented in special way compared to other languages.

57.What exactly a term Ensembling stands for in a predictive modeling?

Ans:

  • In general, ensembling is technique of combining two or more algorithms of a similar or dissimilar types called base learners.
  • This is done to make a more robust system which incorporates a predictions from all the base learners.

58.What is difference between do while and do until?

Ans:

An important difference between a DO UNTIL and DO WHILE statements is that are DO WHILE expression is evaluated at the top of the DO loop. If the expression is false first time it is evaluated, then a DO loop never executes. Whereas a DO UNTIL executes at least once.

59.How do specify a number of iterations and specific condition within a single do loop?

Ans:

  • data work;
  • do i=1 to 20 until(Sum>=20000);
  • Year+1;
  • Sum+2000;
  • Sum+Sum*.10;
  • end;
  • run;

60.What are parameters of Scan function?

Ans:

This is how a scan function is used. scan(argument,n,delimiters) Here, argument specifies a character variable or expression to scan,n specifies which word to a read, and delimiters are special characters that must be an enclosed in single quotation marks.

61.If variable contains only numbers, can it be a character data type?

Ans:

Yes, it depends on how use a variable. There are some numbers we will want to use as categorical value rather than a quantity. An example of this can be a variable called “Foreigner” where an observations have the value “0” or “1” representing not a foreigner and foreigner respectively. Similarly, ID of a particular table can be in a number but does not specifically represent any quantity. Phone numbers is the another popular example.

62.If a variable contains letters or a special characters, can it be numeric data type?

Ans:

No, it is must be a character data type.

63.What can be size of a largest dataset in SAS?

Ans:

  • The number of observations is limited only by a computer’s capacity to handle and store them.
  • Prior to SAS 9.1, SAS data sets could contain up to a 32,767 variables. In SAS 9.1, a maximum number of variables in a SAS data set is limited by a resources available on the computer.

64.Give some examples where a PROC REPORT’s defaults are different than a PROC PRINT’s defaults?

Ans:

  • No Record Numbers in a Proc Report.
  • Labels (not var names) used as a headers in Proc Report.
  • REPORT needs a NOWINDOWS option.

65.Give some examples where a PROC REPORT’s defaults are same as a PROC PRINT’s defaults?

Ans:

  • Variables/Columns in a position order.
  • Rows ordered as they appear in a data set.

66.What is purpose of trailing @ and @@? How do use them?

Ans:

The trailing @ is also known as the column pointer. By using the trailing @, in a Input statement gives the ability to read a part of a raw data line, test it and then decide how to read a additional data from the same record. –
  • The single trailing @ tells a SAS system to “hold the line”.
  • The double trailing @@ tells a SAS system to “hold the line more strongly”. An Input statement ending with @@ instructs a program to release current raw data line only when there are no data values left to be a read from that line. The @@, therefore, holds a input record even across multiple iteration of data step.

67.What is difference between the Order and Group variable in a proc report?

Ans:

  • If variable is used as a group variable, rows that have a same values are collapsed.
  • Group variables produce a list report whereas order variable in produces a summary report.

68.Give some ways by which can define a variables to produce the summary report (using proc report)?

Ans:

All of the variables in the summary report must be defined as group, analysis, across or a computed variables.

69.What are default statistics for means procedure?

Ans:

n-count, mean, standard deviation, minimum, and maximum.

70.How to limit a decimal places for variable using a PROC MEANS?

Ans:

By using a MAXDEC= option.

71.What is difference between the CLASS statement and BY statement in proc means?

Ans:

  • Unlike a CLASS processing, BY processing requires that data already be sorted or indexed in a order of the BY variables.
  • BY group results have a layout that is various from the layout of a CLASS group results.

72.What is difference between the PROC MEANS and PROC Summary?

Ans:

The difference between a two procedures is that PROC MEANS produces a report by a default. By contrast, to produce a report in a PROC SUMMARY, and must include a PRINT option in a PROC SUMMARY statement.

73.How to specify a variables to be processed by a FREQ procedure?

Ans:

By using a TABLES Statement.

74.Describe the CROSSLIST option in TABLES statement?

Ans:

Adding a CROSSLIST option to TABLES statement displays a crosstabulation tables in ODS column format.

75.How to create a list output for crosstabulations in proc freq?

Ans:

To generate list output for the crosstabulations, add a slash (/) and LIST option to the TABLES statement in a PROC FREQ step.TABLES variable-1*variable-2 <* … variable-n> / LIST;

76.Where do use a PROC MEANS over PROC FREQ?

Ans:

Will use a PROC MEANS for numeric variables whereas and use a PROC FREQ for categorical variables.

77.Explain how merging helps to a combine data sets.

Ans:

  • Merging combines the observations from two or more SAS data sets into the single observation in a new data set.
  • A one-to-one merge, , combines observations based on their position in a data sets. And use a MERGE statement for one-to-one merging.

78.What do understand by bagging?

Ans:

Bootstrap aggregating, also called a bagging, is a machine learning ensemble meta-algorithm designed to improved the stability and accuracy of machine learning algorithms used in a statistical classification and regression. Additionally, it lowers variance and aids in preventing overfitting.

79.What is an interleaving in SAS?

Ans:

Interleaving combines the individual, sorted SAS data sets into a one sorted SAS data set. For each observation, the value of variable by which the data sets are sorted. An interleave data sets using SET statement along with the BY statement.

80. Mention two tables authorization objects need in order to maintained?

Ans:

  • USOBT
  • USOBX

81. Want to run a regression to predict a probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up model. How can address this?

Ans:

This is equivalent to making a model more robust to the outliers.

82. Analyze this dataset and give me a model that can predict this response variable.

Ans:

  • Start by a fitting a simple model do some feature engineering accordingly, and then try some complicated models. Always split a dataset into train, validation, test dataset and use a cross validation to check their performance.
  • Determine if the issues is classification or regression.
  • Favor a simple models that run quickly and can easily explain.
  • Mention a cross validation as a means to evaluate a model.
  • Plot and visualize a data.

83.What could be some issues if distribution of the test data is significantly different than a distribution of the training data?

Ans:

The model that has a high training accuracy might have a low test accuracy. Without further knowledge, it is hard to know which dataset represents a population data and thus the generalizability of algorithm is hard to measure. This should be mitigated by a repeated splitting of train vs test dataset .

When there is change in a data distribution, this is called the dataset shift. If train and test data has a various distribution, then the classifier would likely overfit to a train data.

84.What are the some ways can make my model more robust to outliers?

Ans:

Can have a regularization such as L1 or L2 to reduce a variance (increase bias):

Changes to algorithm:

  • Use a tree-based methods instead of a regression methods as they are more resistant to outliers. For statistical tests, use a non parametric tests instead of the parametric ones.
  • Use a robust error metrics such as a MAE or Huber Loss instead of MSE.

Changes to a data:

  • Winsorizing a data.
  • Transforming a data (e.g. log).
  • Remove them only if certain they’re anomalies not worth predicting.

85.What are some differences would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be a appropriate?

Ans:

    MSEMAE
    MSE is a more strict to having outliers. MAE is a more robust in that sense, but is harder to fit a model for because it cannot be numerically optimized. So when there are less variability in a model and the model is computationally simple to fit, we should use MAE, and if that’s not the case, and should use a MSE.
    Easier to a compute the gradient. Linear programming needed compute the gradient.

86.What error metric would use to evaluate how good binary classifier is? What if classes are imbalanced? What if there are more than a 2 groups?

Ans:

Accuracy: Proportion of an instances predict correctly. Pros: intuitive, simple to explain, Cons: works poorly when class labels are imbalanced and the signal from a data is weak.

AUROC: Plot fpr on a x axis and tpr on y axis for various threshold. Given a random positive instance and a random negative instance, the AUC is a probability that can identify who’s who. Pros: Works well when testing ability of distinguishing the two classes, Cons: can’t interpret predictions as a probabilities so can’t explain a uncertainty of the model.

87.What are different ways to predict a binary response variable? Can compare two of them and tell me when one would be more appropriate?

Ans:

Logistic Regression features are roughly linear, problem roughly linearly separable.robust to noise, use l1,l2 regularization for model selection, avoid overfitting.The output come as a probabilities efficient and the computation can be a distributed can be used as a baseline for the other algorithms (-) can hardly handle a categorical features.SVM with nonlinear kernel, can deal with problems that are not linearly separable (-) slow to train, for the most industry scale applications, not really efficient.

88.What is a regularization and where might it be helpful? What is an example of using a regularization in a model?

Ans:

Regularization is a useful for reducing variance in a model, meaning avoiding overfitting . For example, and can use L1 regularization in Lasso regression to penalize a large coefficients.

89.Why might it be preferable to include a fewer predictors over many?

Ans:

  • When add a irrelevant features, it increases model’s tendency to the overfit because those features introduce more noise. When two variables are correlated, they might be a harder to interpret in case of a regression, etc.
  • Curse of a dimensionality.
  • Adding random noise makes a model more complicated but useless.
  • computational cost.
  • Ask someone for the more details.

90.Given a training data on tweets and their retweets, how would predict a number of retweets of a given tweet after 7 days after only observing 2 days worth of data?

Ans:

  • Build time series model with a training data with a seven day cycle and then use that for new data with only 2 days data.
  • Ask someone for the more details.
  • Build regression function to estimate a number of retweets as a function of time .

91.How could collect and analyze data to use social media to predict a weather?

Ans:

  • Can collect a social media data using twitter, Facebook, instagram API’s. Then, for example, for twitter, and can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, features created from a tweeted content itself.
  • Then use multi variate time series model to predict a weather.
  • Ask someone for a more details.

92.How would construct a feed to show relevant content for site that involves user interactions with items?

Ans:

Can do so using a building recommendation engine. The easiest can do is to show contents that are famous other users, which is still a valid strategy if for example the contents are the news articles. To be more accurate, and can build a content based filtering or collaborative filtering. If there’s enough a user usage data, can try collaborative filtering and recommend contents are other similar users have consumed. If there isn’t, and can recommend similar items based on a vectorization of items (content based filtering).

93.How would design a people may know feature on LinkedIn or Facebook?

Ans:

  • Find a strong unconnected people in weighted connection graph.
  • Explain similarity as how strong two people are connected.
  • Given a certain feature, and can calculate the similarity based on a friend connections (neighbors).
  • Check-in’s people being at a same location all the time.
  • Same college, workplace.
  • Have randomly dropped a graphs test the performance of algorithm.

94.How would predict who someone may want to send Snapchat or Gmail to?

Ans:

  • Ask someone for a more details.
  • People who someone sent emails most in the past, conditioning on a time decay.

95.How would suggest to a franchise where to open a new store?

Ans:

  • Build a master dataset with the local demographic information available for each location.
  • Local income levels, proximity to traffic, weather, population density, proximity to the other businesses.
  • A reference dataset on a local, regional, and national macroeconomic conditions .

96.In a search engine, given partial data on what user has typed, how would predict the user’s eventual search query?

Ans:

Based on a past frequencies of words shown up given a sequence of words, and can construct conditional probabilities of set of next sequences of words that can show up (n-gram). The sequences with the highest conditional probabilities can show up as a top candidates.

97.How would build a model to predict a March Madness bracket?

Ans:

One vector each for a team A and B. Take the difference of two vectors and use that as an input to predict probability that team A would win by training the model. Train a models using past tournament data and make a prediction for a new tournament by running the trained model for each round of tournament.

98.Is regression a predictive model?

Ans:

Regression analysis is the form of predictive modelling technique which investigates a relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for the forecasting, time series modelling and finding a causal effect relationship between the variables.

99.What is an example of predictive modeling?

Ans:

Examples include a using neural networks to predict which winery a glass of a wine originated from or bagged decision trees for the predicting the credit rating of a borrower. Predictive modeling is often performed using a curve and surface fitting, time series regression, or machine learning approaches.

100.What is goal of predictive modeling?

Ans:

Predictive modeling is commonly used a statistical technique to predict future behavior. Predictive modeling solutions are form of data-mining technology that works by analyzing historical and current data and generating the model to help a predict future outcomes.

Are you looking training with Right Jobs?

Contact Us

Popular Courses