Cake without eggs

How GitHub code search works. Plus how it is important to know the users' thoughts before launching a product.

Jun 24, 2023

Hi Friends,

welcome to the 28th edition of the Polymathic Engineer.

This week I want to share with you a nice story about the Betty Crocker brand that I heard from our product manager.

I found it interesting because it shows how crucial is knowing your users' thoughts before launching a product and how adding features to a product do not guarantee that the product will do better.

In addition I will talk about:

how GitHub code search works under the hood
coding challenge
interesting tweets

The Betty Crocker effect

In the 1950s, General Mills launched a line of cake mixes under the iconic Betty Crocker brand.

The package contained all the dry ingredients, including milk and eggs, in powdered form. You just needed to add water, mix, and your cake was oven-ready.

Despite its convenience and the trusted brand, sales were surprisingly low. General Mills called in psychologists to figure out what was going wrong. The results were revealing, yet unexpected: consumers were experiencing guilt.

It appeared that homemakers felt dishonest when they used this extremely convenient product. The cakes were so delicious that it seemed as if they'd spent hours in the kitchen, leading to a sense of undeserved credit.

General Mills needed to react quickly. A typical marketing strategy would be an advertising campaign to tackle this guilt, perhaps showcasing how this innovative product could free up time for other valuable family activities.

However, General Mills took a counterintuitive approach. They chose to revisit the product to make it less convenient. In the new iteration, homemakers were asked to add water and a real egg, creating the illusion that the powdered egg was removed.

Following this change, the product was relaunched with the slogan "Add an Egg," and, fascinatingly, sales took off. This simple change reduced guilt and created a sense of ownership. It made the cake-making process feel more meaningful. It could sound far-fetched, but it turned the tide for the cake mix.

This case study offers valuable insight into a consumer's psychology. It reminds us that what we perceive as a convenience only sometimes aligns with the consumer's desire.

And more in general, it reminds us that adding features or making a product more sophisticated is only sometimes beneficial. Sometimes subtracting elements from a product can increase user engagement or improve the user experience.

GitHub code search

GitHub's code search is a powerful tool. It enables the search of 200M+ public repositories in seconds.

Code search differs from the regular full text search problem, so to each this performance, GitHub couldn’t use off-the-shelf search solutions like Lucene or Elasticsearch.

With full text, you examine all the words in every document to find a search query. But code search is different and carries unique requirements: there is no stemming, punctuations are ignored and words from queries are left intact

GitHub uses a Rust-based search engine tailored to search through programming languages: Blackbird.

All text search engines uses a data structure called Inverted Index. An inverted index resembles an index section of a textbook:

- all words are ordered alphabetically

- each word is followed by page numbers where the word is found

Blackbird uses a special type of inverted index called an n-gram index. Building such data structure at GitHub scale was quite challenging. The collection of documents was too large for a single inverted index.

The only way to deal with this was by sharding the index. GitHub engineers sharded the documents by Git blob object ID. This ID is given by the SHA-1 hash of the file contents, thus identical files always share the same ID.

The above scheme helps to manage duplicate files and to distribute blobs across different shards.

The overall process of indexing goes through the following steps:

1. a users add/update a repo

2. an event is published to Kafka

3. crawlers fetch the blob content, preparing it for indexing

4. the processed content is published to another Kafka topic

5. documents are partitioned between shards based on blob object IDs

6. the indexer then inserts documents into the shard’s inverted index

7. the inverted index is flushed to the disk after a certain number of updates

Querying the search engine requires parsing the user's query and turn it to an abstract syntax tree. Concurrent search requests are then sent to each shard in the cluster and the aggregated results from all shards are returned to the frontend.

The GitHub example also shows how search system architectures for crawling and indexing moved from being primarily batch to being generally stream processing flows. This change is necessary to transform and augment the data for indexing as it is collected.

If you are interested to this topic and want to dig more I recommend reading this blog post from the GitHub engineering blog.

Coding challenge

The coding challenge of today is Binary Tree Preorder Traversal. A preorder traversal of a binary tree visit first a node and then its children. Implementing the preorder traversal using recursion is straightforward.

What’s more difficult is implementing the same solution iteratively using an explicit stack. Here is a possible iterative solution.

Notice how the call to reach the left and right children is inverted respect to the recursive solution. In both cases the space and time complexity is linear in the number of nodes in the tree.

Interesting Tweets

After years of work experience I came to the conclusion that a dream job doesn't exist. Every job ha its pros and cons. For me having a good enough salary and working with nice people on interesting stuff is more than enough. Link

Moving to management should never be only a way to advance your career and get economically rewarded. Managers doing that only for money without a genuine aspiration are rarely able to do a good job. Link

In theory, not even code coverage should be the reason. If you have private methods that remain partially untested, either the test cases are insufficient, or the private methods have unused parts that can be removed. Link

The Polymathic Engineer

Discussion about this post