Preparing for a Data Interview

Published in

Analytics Vidhya

21 min readJan 8, 2021

The interview process can be a very intimidating thing (trust me, I’m in the middle of it right now.) When reading job descriptions they can be so vague that it is tough to know where to even begin. In this blog post we will go over how to search for a job, effective networking, and possible interview questions/answers to prepare you for your first interview.

The Job Search

The first step in beginning an effective job search is accepting to yourself that the search will not be easy. A job search is a full time commitment, if your plan is to simply apply to a couple of job posts via LinkedIn, I would think again. The goal is to show you are knowledgable in the field of Data. To do this, you need to develop a plan to network, stay current on technology, and show that you have the necessary soft skills for success.

Effective Networking

Let’s begin with my least favorite aspect of the job search, networking. If you are an introvert (much like myself) this may be the toughest part of the job search for you. Networking is taking a step outside of your comfort zone to connect with others in the field you are desiring a career in. There are three main types of networking: Connected networking, Cold networking, and Meetups.

Connected Networking: This form of networking is probably the easiest. Connected Networking means using the people around you to connect you to others within your field. The key here is to not rule anyone out. Just yesterday I contacted one of my friends whose career has nothing to do with tech, but through that one phone call he connected me with three different people at three different companies that are in the tech field. By simply asking friends and family for connections, you are able to create valuable contacts. Once you receive a contact from a friend or family member reach out and ask for a 10–15 minute phone call to ask more about their career. Great questions to ask could be…

How did you get started in your role as ______?
What does a typical day look like in your role?
What types of projects are you currently working on?
As an aspiring _____, what skills and experiences do you think I need to obtain a job?
Did you have a formal training or did you learn on the job?
(if person had the same job title at more than one company) How did your job as a _____ differ between your two employers?
How did you end up at your current company?
Do you work more independently or as a team?
What is a piece of advice you would give an aspiring _____?
If you were the hiring manager, what are the top 3 traits you would be looking for in an interviewee?

Cold Networking: This type of networking is reaching out to individuals whom you don’t already know. Examples of these contacts could be someone who has the role you want at the company you want. When reaching out DO NOT ask for a job. The goal of cold networking is the same as Connected Networking: to build a relationship with the individual. Setting up a phone call or coffee chat (I write this during the COVID pandemic so currently a phone call) can help you get to know the person and learn what that specific company wants and needs. If you are lucky, this cold outreach could potentially lead to a job at your dream company. Examples of cold outreaches would be messaging via LinkedIn or finding the individual’s company e-mail address and reaching out. Remember to always keep your e-mails/messages professional and to always have an appropriate ask within the e-mail. An example of an appropriate ask would be if they have time for a short phone call to discuss their career. An inappropriate ask would be if they can get you an interview. Cold Networking is slightly tougher than Connected Networking, but it has the potential to link you to a specific company.

Meet Ups: Another great way to network is Meet Ups. Currently most in person Meet Ups have been cancelled and are instead taking place on a virtual platform. This is another way to connect with people in your field. Looking at Meetup.com today there are countless virtual Meet Ups going on this week on the topic of data science, analytics, and python. If you connect with someone, don’t be afraid to ask for their contact information in order to stay in touch.

An important note with networking is that if you are successful and are able to chat with someone over the phone or for coffee ALWAYS write a thank you letter within 24 hours. Whomever you spoke to took time out of their busy day to talk to you, the least you can do is formally thank them.

Staying current on Tech

Technology is constantly changing. Python was first released in 1991, yet did not win the language of the year until 2007. Within these 16 years Python did not stay stagnant, it was constantly updated and improved for easier use which is what has made it such a popular language today. As with Python constantly evolving, so does the data scientist. If you are able to show the interviewer how you are staying current, you will be much more likely to get the job over someone who is stagnant. Below are ideas on how you can stay current on your Data skills:

Podcasts:

The Python Podcast.__init__ → podcast about Python
Chai Time Data Science → Series where Sanyam Bhutani interviews his Data Science Heroes
Towards Data Science → A Medium publication sharing concepts, ideas, and codes.
Data Science Imposters Podcast → Explore data science, analytics, big data, machine learning as discussed within the podcast
The Real Python Podcast → A weekly Python podcast hosted by Christopher Bailey with interviews, coding tips, and conversations with guests from the Python community.
Data36 Data Science Podcast → Podcast great for aspiring or junior data science professionals
Learn to Code → Hosted by Chris Castiglione. Chris interviews successful business founders, startups, and programmers to ask them: How did you learn to code? What tips and tricks do you have for finding meaningful work?
Python Course → Intro to Learning Python

**All descriptions taken via the Podcast’s summary on Spotify

Meet Ups: Once again we see the importance of Meet Ups. Not only are Meet Ups used as a form of Networking, but they can also be used as a learning tool. There are many different forms where one can learn new skills or simply talk to others in the data industry for new ideas.

GitHub Projects: In the words of my fourth grade teacher, “Find something you are passionate about and you will never work a day in your life.” This is the time to explore your passions and to show that you are still exploring the field of data science. Take a deep dive into data that you find interesting. For example, I love science. As an ex- science teacher I can deep dive into this content for hours on end. A new project I am thinking about is something using NASA’s Open API. This allows me to explore my interests while expanding my GitHub projects. When creating GitHub projects, try to pretend you are building these for a client. Always include proper markdowns, a presentation, and a through ReadMe explaining what each project is and how someone would be able to replicate it.

Hackathons: Hackathons are a great way to continue learning and to meet other people in the industry. If you were to google “Hackathon” countless of opportunities pop up, some are specific to region or where are you at in life (ex. just for high schoolers or just university). Using a site such as devpost.com allows you to sign up to hear about new hackathons. When signing up they ask about your interest and what areas of tech you are confident in. The website is open to supporting you as a learner, one can learn, create, or compete in hackathons. It may be a step outside of your comfort zone, but I encourage you to try!

Books:

Naked Statistics: This is my current read where Charles Wheelan “stripped the dread from data”. In his book Charles almost makes statistics simple to understand using real life examples (some rather comical I might say) to help one understand the intimidating topic of statistics.
Data Science From Scratch: This is on my list of books to read. Joel Grus breaks down data science to the nitty gritty. Instead of using each data science library Grus breaks down the problem and implements it from scratch providing the reader with a deeper understanding of what is occurring.

There are countless of other books out there to read on data science. (I personally just took a deep dive and added about 5 books to my list). I won’t recommend any more books than this because I’m a newbie and haven’t read them yet! If you have recommendations on great data science books please leave a comment on the bottom of this post and I will happily add it to my list!

Blogging: When I state blogging here there are actually two parts to it: follow informational blogs that allow you to grow in the field and creating your own blog.

Following informational bloggers:

Data Science Central → The industry’s online resource for data practitioners to come and blog.
Kaggle Winner’s Blog → This blog is posted through medium, but interviews with top Kagglers to discuss the projects and career paths.
Simply Statistics → “Written be three biostatistics professors who are fired up about the new era where data are abundant and statisticians are scientists.”
Madhavthaker.com → This is a data scientist I have been closely following myself. “An educational forum for all aspiring data scientists” Madhav has Q&A content, blog posts, as well as a Youtube Channel where he goes over anything and everything data science.
Analytics Vidya: Another one of my favorites, Analytics Vidya has blog post written from different individuals in the data science community. This site also has free courses, hackathons, and job postings. If you are feeling confident, you can even compete in the Data Science Blogathon’s with your own content.

**Some of the above descriptions were summarized via the about section of the blogs.

Creating your own blog:

Creating your own blog is helpful in so many ways. First, you learn new content about data science and are able to dive deeper into understanding the material. As an ex-teacher I truly believe the statement, “You don’t truly know a topic, until you can teach it to another person.” Breaking down the topic so that another person can read and understand not only helps the data science community, but also helps yourself. Another aspect that is beneficial about creating your own blog is that it is another source of networking. Share your blog posts via linked in then read and respond to others posts. This is an opportunity to not only share what you know, but to learn from others.

Soft Skills

You could be an amazing data scientist, but if you don’t have the soft skills to interview and communicate with your future co-workers it may be tough to land a job. The number one soft skill to have is the ability to communicate. As a data scientist you work on complex projects that others in your company might not be able to understand without proper explanation. Being able to present a topic or an idea is essential. Ways to improve on your communication skills include: taking a public speaking class, inviting your non-tech friends over to practice explaining topics, practicing written communication via emails, and non-verbal communication such as handshakes and eye contact. Other soft skills that are important to have include the ability to work on a team, leadership, critical thinking, work ethic, adaptability, and reliability.

Example Interview Questions and Tips

Non-Technical Interview Questions:

Tell me about yourself:

Let me start off by staying that this is not an opportunity to discuss your out of work hobbies with the interviewer. Tell me about yourself is the perfect opportunity for you to discuss your fit in the company without coming right out and saying it. In a minute or so share your past experience and what led you to this interview and where you would like to be in the future. Ex. I spent the last five years teaching and although I loved my job I was ready for a change into a career where I could continue to push myself and grow. After talking to my husband I realized I always had a passion for data and problem solving. While teaching I led the data collection and was always entranced by what I could solve. Therefore I chose to leave the teaching field are pursue a new passion of data and enrolled in a data science bootcamp. It is here that I learned so many new concepts and was able to apply them to complex projects. I am hoping to find a company where I can continue to grow not only in my technical skills, but as a team player as well.”

2. Why are you interested in working for us?

This is basically asking “Did you do your homework?” and “Do you actually care about this company or do you just want a job?” Before interviewing with a company take time to look at what they have accomplished. What awards have the received, how have they positively impacted the community, what do you love about their project? Take time to state what makes that company great and also try to plug in how YOU would love to be apart of their goal and mission.

3. What are your biggest weaknesses?

Another question that I would prepare the answer to ahead of time. The goal here is to take your weakness and make it a strength. For teaching my answer to the question what that “I love spending time with the kids. When they visit after school I have a hard time kicking them out in order to go home.” See how this answer is a weakness, but it’s also a positive for that specific career. It shows a passion about the career. Take time to think about your data career, what could be a weakness that is also positive? Do you get lost in the data and spend hours looking at one items? Is it that you feel like you never know enough and are constantly researching new libraries?

4. Tell me about a time where there was a challenge or confrontation at work, what did you do about it?

Aka are you a team player? If there are disagreements can you be the peace keeper? Do you push the blame on others or do you push to solve the issue? How do you solve problems?

5. What is your expected salary?

I know a lot of people answer here with “I’m negotiable”, but I really don’t think that is the best answer. Do your research. What are others in the same role making? What is your goal salary? If you do research on a job and you want to make $75,000, respond with that number in range. “Based on my research for the job title of X in the city we live I would like to make between $70,000 and $90,000.” I always like the second number to be a little bit high, just incase they low ball. By providing a number range it shows that you have done your research and are once again prepared for the job.

Other pieces of advice for your non-technical interview:

RESEARCH THE COMPANY. In detail. What is the company doing now? What questions do you have about the company? Why are you a great fit?
Make it a conversation → No one likes an interview where it is simply a one way Q&A session. Ask questions back DURING the interview. Don’t save them all for the end. If the interviewer brings up a topic you are interested in, ask right away!
Using the STAR method to respond. This is extremely useful in a data science interview. Tell the interviewer the situation, task, action, and final result. This allows for detail and gives you more room to talk about what you know and have accomplished.
Ask “What does it look like to meet expectations in this role? What does it look like to exceed expectations in this role?” If you do get the job, this question is already setting you up for success! You will have a list of items that your company already told you they desire. What an awesome question!
Before the interview research questions they may ask and practice the answer out loud to yourself, a roommate, or a spouse.
Smile.
Always send a thank you e-mail within 24 hours of the interview.

Technical Interview Questions:

Probability/Problem Solving:

What is the probability that I select two cards of the same kind (ex. 2 kings or 2 sevens) from a full deck?

Great probability questions are those having to do with flipping a quarter or using a deck of cards of cards like above. Let’s break down this question: The first card you grab does not matter. I can grab a 4 or a queen, no matter what my first card it is 100% a card I need. The real importance comes with the second card. If my first card grabbed was a four of hearts that means I need to grab a four of spades, clubs, or diamonds. How many cards are in the deck now? 52–1 = 51 cards. How many fours are left? 3. Therefore the probability that I grab a four is 3/51 and also the answer to the question.

2. A watermelon must be cut into 8 equal parts using 3 cuts. How is this possible?

This is a common problem solving question. Take time to verbalize your thoughts with the problem, if allowed perhaps draw out the answer for your interviewer. The answer here would be to cut the watermelon in half horizontally and then to make an X cut vertically. There you go, 8 equal cuts!

3. There is a fair coin (one side heads, one side tails) and an unfair coin (both sides tails). You pick one at random, flip it 5 times, and observe that it comes up as tails all five times. What is the chance that you are flipping the unfair coin?

Let’s first think about the probability of flipping a fair coin tails 5 times in a row. The probability of a coin flip is 1/2, therefore you would set up this equation as (1/2)**5 or 1/32. Okay now for flipping the unfair coin, well if its unfair 100% (or 1) of the time it should land in tails. Since the coins are chosen at random you have 1/2 chance of choosing the 1 coin and a 1/2 chance of choosing the 1/32 coin. Let’s calculate the probability of tossing a tails, the math should be set up as (1/2 * 1) + (1/2 * 1/32) = 33/64 or 52%. Now we use Bayes Law: 𝑃(𝑈∣𝑇)𝑃(𝑇)=𝑃(𝑇∣𝑈)𝑃(𝑈).

𝑃(𝑈∣𝑇)*(33/64)=(1)*(1/2)

𝑃(𝑈∣𝑇) = 32/33 or .97%

4. Two fair dice are rolled, what is the probability that two dices sum to 6?

Ways to roll a six:

Dice A:1, Dice B:5
Dice A: 2, Dice B: 4
Dice A: 3, Dice B 3;
Dice A: 4, Dice B 2;
Dice A: 5, Dice B: 1

There are 5 possible ways to roll dice to sum up to six, but a total of 36 outcomes (6*6). Therefore the probability of rolling a six is 5/36.

Statistics:

Explain the Central Limit Theorem.

If you have a population with mean μ and standard deviation σ. The distribution of the sample population should be approximately normally distributed.

Example: Farmer Joe is known for having the largest pumpkins in town. Most of his pumpkins are over 10 lbs! For his pumpkin patch the mean weight of the pumpkins is 10.4 lbs with a standard deviation of .5 lbs. Betsy tries to sell you an 8 lbs pumpkin. Did this come from Joe’s farm? We can say with 97% certainty that this pumpkin did NOT come from Farmer Joe because it is over 3 standard deviations from the mean of his farm.

2. Differentiate between a Type 1 and Type 2 error.

Type 1 Error: False positives → A false positive within health care could be a false positive mammography. This could give the patient unneeded fear that they have breast cancer and lead to unnecessary biopsies.

Type 2 Error: False Negatives → False negatives can be even more harmful within health care. If someone has breast cancer, but receive a false negative then they will not get the proper treatment.

3. Explain selection bias.

Selection bias is the unintended differences between participants in different groups that does not accurately reflect the target population. Examples of selection bias include self-selection, the researcher deciding who will be tested/how, pre-screening of trial participants, time bias, and exposure bias. It is important to overcome selection bias in order to ensure your study is valid. Ideas to combat against selection bias include larger samples groups, matching individuals in study/control group as close as possible, conducting blind studies, using technology to randomly select participants.

4. Explain p-value and statistical significance.

p-value is the probability that an observed difference could happen by chance. The main question here “is your p-value less than your alpha value?” The most common alpha is .05, although depending on the situation it could be different. If your p-value is lower than your alpha you will reject the null hypothesis and it is said to be statistically significant. If p is higher than the alpha you fail to reject the null hypothesis.

Python:

Using python remove any duplicates from the list. a = [2, 2, 3, 4, 5, 5, 5, 6, 7, 7]

There are a couple of ways you can solve this problem. The first will only use Python and the second I will show how to solve the answer by incorporating pandas.

#using python
new = []
for i in a:
    if i not in new:
        new.append(i)a = new

— —

#using pandas
import pandas as pd
a = pd.DataFrame(a)
a.drop_duplicates(inplace = True)

2. Identify some commonly used built-in Python modules.

os
datetime
math
itertools
random

3. What are docstrings and what is the proper way to write them?

Docstrings are documentation strings enclosed in triple quotes usually used to explain a function. You may have one-lined docstrings or multiline docstrings. The formal way to write a doc string is shown below:

#one line
def function(arg1, arg2, arg3):
     """Summary do this and return this."""#multiline
def function(arg1, arg2, arg3):
    """Summary line.

    Extended description of function.

    Keyword Arguments:
        arg1 (int): Description of arg1
        arg2 (str): Description of arg2
        arg3 (float): Description of arg3    Returns:
        bool: Description of return value


    Examples:
        Examples should be written in doctest format, and should illustrate how to use the function.

        >>> func(a, b, 3)

    """

4. What is a difference between lists and tuples in python?

List are mutable, where as tuples are immutable.

list = [‘apple’, 5, ‘orange’]

tuple = (‘apple’, 5, ‘orange’)

SQL:

Explain what SQL is.

SQL stands for Structured Query Language. It is a type of programming language that communicates with the Database. SQL can be used to retrieve, insert, delete, and update information from the Database.

2. Explain the different types of JOINS.

Inner Join: Returns records that have matching values in both tables.
Left (Outer) Join: Returns all records from the left table and the matched records from the right table
Right (Outer) Join: Returns all records form the right table and the matched records from the left table.
Full (Outer) Join: Returns all records where there is a match in either left or right table.

3. Using the employee table below fetch all active employees aged 30 and over.

‘’’SELECT Name, Age, Status 
FROM employee 
WHERE Status == “Active” AND Age > = 30;'''

4. Using the table above create another table of active employees?

'''CREATE TABLE active_employees
   AS (SELECT *
      FROM employee WHERE Status = 'Active');'''

Machine Learning:

Explain K nearest neighbors.

K nearest neighbors is a lazy algorithm used to solve classification and regression problems. Simply said KNN stores the dataset and when provided new data it classifies that data into the category with the most similarity. Some rules with KNN are that all features should use the same scale, K needs to be odd, votes can be weighted by the distance to the neighbor (closer observations worth more). A smaller K produces a lower bias and high variance (overfit). A larger K produces higher bias and lower variance (underfit).

2. What is wrong with training and testing a machine learning model on the same data?

If you train the model on the same data that you will also test it on, it is not effective. The model will learn from the training data and will become overfit during testing giving horribly misleading results.

3. Explain what regularization is and why it is useful.

Regularization is used to prevent overfitting of a model. Common examples of regularization are Lasso (L1 Regularization) and Lasso (L2 Regularization).

Lasso method bounds the sum of the absolute values driving the parameters to zero.
Ridge method changes cost function by adding a penalty term to the square of the magnitude of the coefficients

4. What is more important precision or recall?

This is another answer where it depends on the situation. Below is an image representing precision and recall.

Precision is the percentage of your results which are relevant. Since false negatives are not present in the precision formula it does not take this into account. Precision calculates the ratio of correct positive predictions out of all positive predictions that are made. We could think of the COVID19 test, if it had a precision of 97% that would mean 97% of the positive predictions are correct. If you are dealing with a scenario where you do not want false negatives it would be better to look at recall. If the COVID19 test gave a high number of false negatives that could be dangerous because people would be spreading the virus and not even know. (I caught COVID19 from a loved one who had a false negative.) Recall is the percentage of relevant results that are correctly classified, this includes true positives and false negatives.

Random:

What is the difference between Data Mining and Data Profiling?

Data mining discovers relationships and patterns in large data sets that are then used to make better business decisions and predict outcomes. Techniques: clustering, classification, regression, prediction, association, anomaly detection.

Data profiling is examining data from an existing source and summarizing information about the data. Techniques: Structure discovery, Content discovery, and Relationship discovery.

2. Explain the difference between univariate, bivariate, and multivariate analysis.

Univariate — summarizes only one variable at a time. Ex. Group of high school students to find average ACT score.

Bivariate — summarizes two variables at a time. Ex. Age of high school students vs. their ACT score.

Multivariate — summarizes two or more variables. Ex. Studying psychological variables, public vs private schools, and age in comparison to ACT score.

3. What do you know about Tableau vs. PowerBI?

Power BI — An easy to use platform which uses Microsoft systems such as Azure, SQL, and Excel to build data visualizations. PowerBI is free for organizations that already have Office 365 and is cheaper than Tableau.

Tableau — As of now is the more popular BI interface known for its visualization capabilities and ease of use. Tableau is a better source for use with large datasets as it allows the user to bring in data from spreadsheets, text files, and CSV. Tableau has more data storage on the cloud vs. PowerBI.

4. How would you handle missing data?

To handle missing data you need to get an overall feel as to how the removal would impact the data at large. A couple ways to handle missing data include:

Remove rows with missing values (if values are missing randomly or if you don’t lose too much data)
Build another predictive model to predict missing values
Use a model that can incorporate missing data (Any tree based method)
Quantitative Values could be replaced with mean or median
Qualitative Values could be replaced with most common value

Other pieces of advice for your technical interview:

Don’t just respond with a number answer. Walk the interview through your thought process. They are more intrigued with how you think than if you just answer with the correct response.
Another great read: Google career post interview tips to help prepare you for interviews at their company. When applying for a company take a look to see if they have something similar! https://careers.google.com/interview-tips/

Thank you for taking time to read my blogpost if you have any questions or advice, feel free to reach out! (https://www.linkedin.com/in/laurenesser/)

Preparing for a Data Interview

The Job Search

Effective Networking

Staying current on Tech

Example Interview Questions and Tips

Written by Lauren Esser