SafeGraph Interview Problem

About

SafeGraph provided three questions as part of the hiring process for a Technical Product Manager. Below are my answers for the three questions.

You can view a pdf of the questions here.

Viewing notebook:

Table of Contents

Prerequisites

Load and initialize notebook extensions, install R packages as needed, and load dependencies and setup globals.

Question 1

Question

You are going to launch a new API meant for data science users and you want to have at least one client library ready at launch. Do you build a client in R, Python or both? How do you decide?

Answer

That depends on the expected behavior and knowledge of the user base.

Initially I had these questions:

  1. What percentage is expected to use R vs Python?
    1. We should be able to put together an estimate with current usage data and the expected profile of the new API users.
    2. If they are the same, then which group uses any current client libraries vs raw API calls?
  2. Do more R users know Python or the other way around?
    1. If there is more familiarity with one over the other, the language with the most reach should be favored.
  3. What are the common use cases?
    1. If they are mostly basic data retrieval, it might make sense to roll out MVP in both that supports commonly used GET requests.
  4. What are the major friction points?
    1. Usually things like auth, batching, pagination, etc. Removing friction should be the top priority.

Eventually I decided to just implement a basic call myself. I tried the query used in the cURL version of the directions, but something about the JSON encoding wasn't working correctly. I saw that the query in the cURL example looked like a serialized version of the GraphQL query, so I tried using the GraphQL query as a multiline string while using Python's standard json library to convert (see FIG 1A). Success! I also implemented it using the provided Python client library (see FIG 1B). And I implemented it in R, cheating a bit by reusing the query string generated by Python (see FIG 1C). None of the responses were exactly the same, but the data itself matched in all three cases!

Given what I know, I would still want the answers to the above questions, but now I would err towards having a more complete solution in one of the given languages. Authorization isn't too hard so if we can't get a client libary out in time, we should be able to produce docs with common use cases for the other language. I have never used R before and figured out a basic API call in less than an hour.

Figure 1A: Basic SafeGraph API request in Python

Based on the cURL version of the docs. It uses Python to encode the query string. Authorization handled with a simple header.

Figure 1B: Basic SafeGraph API request using official Python client library

Based on the python version of the docs.

Figure 1C: Basic SafeGraph API request in R

Based on the python and cURL versions of the docs. Since the payload encoded and serialized by Python already works, I just pass that in along with my API key. I then use R's httr package to form a valid request

Question 2

Question

In the first iteration of an API, the engineer creates a response that looks like this: payload%20snippet.png You notice that there is both a “safegraph_brand_ids” field and a “brands” field. Do you keep both? If not, which one do you keep? How do you decide?

Answer

“Don't ever take a fence down until you know the reason it was put up.”
― G. K. Chesterton

Again, my approach would be to gather a little more data before making a decision. Below I outline my thinking and approach.

Main questions:

  1. What are the use cases the developer believes they were solving? Are there other ways to solve it that save us complexity?
    1. I.e. if it’s an index, the customer can compute those. Or maybe store a list of tuples. Or as a dict where the key is the brand name and the value is the id (or vice versa if brand name is not unique).
  2. What is the ongoing maintenance cost of including both vs one?
    1. If it’s just a convenience, but it’s cheap and helps with overall satisfaction (and therefore retention and likelihood for additional sales / upsales) then we might as well keep it.

It looks like this is based on actual data, given the response above. So we should have some users we can talk to, who uses it for what and what does it cost us? Based on those answers we can either: sunset the redundant data with clear docs on other patterns that can solve the same problems -or- keep them both but update the docs with clear use cases and what to do if the data ever doesn't match.

Question 3

Question

How would you improve this example code snippet in the docs? code%20snippet.png

Answer

In order to understand the code a little better, I decided to implement it.

Notes:

To improve the example snippet (see FIG 3C):

  1. fix the payload using the multiline GraphQL query string literal
  2. explicitly use json.dumps() to encode the string literal
  3. use response.json() instead of response.text (or use json.loads(response.text))

Taking the steps above will make the example snippet functional, easier to maintain, and more usable.

Ideally we would not be using string literals to build queries. Debugging can be a pain, which is why I opted for trying a workaround first.

Figure 3A: As written

As written, the snippet doesn't work. The payload line was truncated in the screenshot, giving us the error:

Figure 3B: Corrected payload string

In order to send the correct payload, I will replace the one in the screenshot with one from the docs. The string looked suspiciously like the same one in the cURL version of the directions, so we can try that.

Still we get an error saying the JSON is invalid.

Figure 3C: Updating payload to known good pattern

Above we were able to make an API request using the GraphQL example as a multiline string. This time, we use the text from the GraphQL portion of the same docs.

Looks like it works!

Figure 3D: Checking our work

Let's check the response against the same query sent using the python section of the docs.

The formatting is a little different, but the data itself looks correct!