Designing Product Recommendation Engines for the New Age of Digital Commerce

What is a Recommendation Engine?

In today’s world of rampant digital commerce growth, the topic on every executive’s mind is personalization – or rather how can the ecommerce experiences of tomorrow provide customers with product guidance tailored specifically to their unique needs and tastes. While this is not a new topic in the world of ecommerce innovation, it is has certainly been an evolving topic over the years as the techniques and technology – which some claim is responsible for up to 35% of’s total online revenue – has moved from a proprietary form of intellectual property to an entire studied field of machine learning, complete with open source frameworks and datasets.

While product recommendation initiatives are now so pervasive throughout nearly every digital commerce touchpoint from online discovery, search personalization, and email targeting, it was not always this way. was the first digital retailer to provide a consumer-facing implementation of a product recommendation service, more commonly known to digital shoppers as the “People Also Viewed / Bought / Liked” product carousels that have appear on Amazon product pages as early as 1999. Since then, digital commerce has benefited from a scale of personalization that far out-stripped the ability of site merchants to help curate a tailored set of goods for each and every shopper – a tension that we will talk more at length about later. Innovation in product recommendation is so critical that it remains an active application of emerging new techniques in machine learning (such as deep learning) – Amazon, for example, has recently open-sourced the deep learning framework it uses to perform large-scale latent factor analysis that is designed specifically to help overcome the sparsity issues that plague large scale factor models (a topic we will cover later).

Of course product recommendation is not just an area relegated to physical goods retailers but rather any digital entity that wishes to help ease the content discovery process for its users can make use of product recommendation. The most famous of these is perhaps the video-streaming service Netflix which leverages product recommendation techniques everyday to help tailor its video viewing experience. Netflix has also made significant R&D investments into the field at large – going as far as to sponsor its own machine learning competition called the Netflix Prize.

Teaching a Machine to Merchandise

As I mentioned, product recommendation is now so ubiquitous that we as consumers often forget that it even exists within the digital services we use everyday; however, for data scientists or digital commerce executives working to improve their online performance, it is a critical field to understand. Today, much of the field of product recommendation is common knowledge (at least the classical techniques of large-scale matrix factorization) and the software to enable it is open-source. By themselves, these techniques are useless unless they are properly integrated into the customer experience in an elegant way to drive relevant, tailored digital experiences.

Customer experience personalization is all about data first. Get the data right and you can shape the overall customer experience by applying data science and machine learning.” (MartechAdvisor)

In fact, because these systems are readily available, digital strategies are evolving to focus on the new critical IP – customer and product data – and developing robust new ways to collect, mine, and apply new machine learning techniques to these data sets. In this tutorial, it is my hope to help provide a thorough understanding of just how modern real-time personalization systems work, with a specific focus on how the uniqueness of customer and product data can create competitive advantage for digital projects. In the following material, we will cover:

  • Digital Strategy: We will use a framework called the Machine Learning Canvas to analyze a generic product recommendation system.
  • Data Strategy: Designing the best product recommendation system for your digital organization is about more than just technology – in fact, the form and completeness of your data can have the largest impact to the overall results.
  • Technical Implementation: We will build a simple real-time personalization engine leveraging open-source frameworks (Spark) and some simple code examples that any developer can play with on their own.

The Framework

I would like to introduce a toolkit called the Machine Learning Canvas – a template for developing new or documenting existing predictive systems based on machine learning. It was developed by Louis Dorard and it provides the necessary mental bookshelves upon which to organize our analysis:


  • Core Premise:

    • [A] Value Propositions: What are we trying to do for the end-user(s) of the predictive system? What objectives are we serving?

    • [B] Decisions: How are prediction used to make decisions that provide the proposed value to the end-user?

    • [C] Making Predictions: When do we make predictions on new inputs? How long do we have to featurize a new input and make a prediction?

  • Required Resources:

    • [D] ML Task: Input, output to predict, type of problem

    • [E] Offline Evaluation: Methods and metrics to evaluate the system before deployment.

    • [F] Data Sources: Which raw data sources can we use (internal and external)?

    • [G] Features: Input representations extracted from raw data sources

  • Model Operations

    • [H] Collecting Data: How do we get new data to learn from (inputs and outputs)?

    • [I] Building Models: When do we create / update models with new training data? How long do we have to featurize training inputs and create a model?

    • [J] Live Evaluation and Monitoring:Methods and metrics to evaluate the system after deployment, and to quantify value creation

ML Canvas: Recommender Systems

Core Premise: The most basic function of a product recommendation engine is to help show end users products “they might want.” This is such a broad value proposition that can cover a number of more tailored use cases – such as which headline to pick when crafting a newsletter email to drive a return visitor to the site or which hero image to show to a new site visitor; however, the most common use case we will frame our discussion around is the classic product page carousel where the task at hand is to select the set of relevant products to the current one that a user “might also like.”

In the above “product carousel” scenario, the active decision that a machine learning model must serve is selecting the set of products to provide back to a product carousel widget that renders on page-load. While there are methods for doing recommandations in an “offline” mode (i.e. – the system is not constrained by time for generating recommendations), most systems powering this use case require recommendations to be provided in “real-time” as the 100’s of other pieces of the product page HTML and digital assets are being rendered by the web browser. A common problem in designing recommendation engines is determining how to make the best personalized set of results given little to no insight into the user’s background in such a limited about of time (less than 100ms for example) – we will cover a mental model on how to think about this later.


Required Resources: The data required to train and optimize product recommendation models can come in a variety of unstructured (clickstream) and structured (reviews) formats, we will generally refer to the type of data that we want to capture as Product Affinities. Affinities are simply a designation of data that might provide some connection between a given user and a given product. It can be as explicit as a user writing a 4-star review or as implicit as an anonymous user visiting a page and reading that previous customer’s review for a bit “longer” than the normal page visitor.

We will structure our model features in terms of User-Product affinities such that as long as the data can be transformed into a numerical relationship between a user and a product, it can be leveraged for personalization efforts. As we will see later, this data will be used specifically for the task of candidate scoring, which can be thought of as a type of classification process where the system attempts to “assign” a number to each product based on how likely the user is to have an affinity for that particular product. In our model development, we will use the product-affinity data to train a machine learning model that attempts to represent the entire populations affinities for all products based on just a small “sample” of known product affinities.


Model Operations: While we will not cover the data collection process in this tutorial, this is perhaps the most critical area to invest in when designing your organizations own product recommendation system and process. Today’s example will leverage user ratings collected on movies; however, the more common form of data within ecommerce will be in the form of implicit / unstructured clickstream data (i.e. – the data exhaust generated from users trying to find products on your website). Nearly every single digital product catalog generates the necessary information to build a well tailored recommendation service – the data just simply needs to be captured (typically via a tracking “beacon” deployed in the background as a shopper loads the page) and leveraged by a data scientist.

Data scientists working on product recommendation for an organization can leverage this dataset to perform offline training and testing (perhaps on a small portion of the data) of a User-Product affinity model – like the one we will work through today. The model can then be trained in a semi-regular, batch process (usually at night) once it is initially developed. While we don’t need to build our models in real-time with our specific product carousel use case, we do at least need to have any relevant affinity information about the target user collected and made available in real time. This can be engineering in a variety of ways from using a real-time preference cache served within the recommendation API service or even stored locally in the web browser on an ecommerce site.

The Merchant Versus the Machine

Historically, there has been a lot of tension between site merchants within ecommerce organizations and digital product recommendation engines – as both view their job in a similar vein: choose a set of relevant products to show a customer. Merchants, in general, practice the coveted art of product merchandising – which I would describe broadly as the act of attempting to understand evolving customer preferences and predicting the likely set of products that will maximize their financial “bet” (with regard to a particular retailer’s financial allotment for a given category). Merchandising is so core to retailing as an industry, that the world’s leading retail organizations can trace their roots back to core merchant-driven leadership – such as Sam Walton (Walmart) and Pat Farrah (Home Depot).

Given the role merchandising plays in retailing strategy, systems that have attempted to claim competency in a “similar” function have often been received with a heavy dose of skepticism from internal teams. While those digital teams that have been able to overcome this bias have made significant gains, it is critical to re-frame the issue not in terms of who is better at what but rather who’s time is better spent where. When you think about the trade-off between merchant time and financial return, selecting products to merchandise with other products is incredibly inefficient with regard to machine-based recommendations – not to mention as digital retail catalogs continue to scale up each year, the number of necessary merchandising decisions per product grows exponentially. Instead, the role of product recommendation engines should be viewed as a vehicle to enable merchandising organizations to apply their valuable time and mental energy to more financially efficient activities such as selecting a title for an email campaign or updating an open-to-buy order.

Data is the Fuel for Algorithms

Most digital executives mike think that competitive advantage within product recommendation comes from innovation in the actual algorithms driving the product recommendation process, this is quite the opposite in today’s world. The truth is that while there is a great deal of innovation that can happen at the algorithm level, these kinds of gains will be minimal at scale relative to off-the-shelf frameworks that are free to integrate (and often times more battle-tested than experimental in-house systems). In today’s landscape, the true differentiator that digital teams can bring to the entire process of designing and improving product recommendations comes from the quality of the data that goes into the models being developed.

We will cover a few examples of this later, but take a simple scenario of a shopper who is searching for a new laptop. Having well structured, relevant product and customer data improves recommendation results in two critical steps during the process:

  • Candidate Selection: The first step in any recommendation process starts first by identifying a list of all relevant laptops that might match a user’s particular inquiry. Selecting the “right” set of products to fill the recommendation “hopper” can provide the biggest gains to any personalization initiative – this is because in most recommendation scenarios there is relatively little information about a users tastes to properly rank products and, therefore, the system must rely on raw contextual information to determine a good set of starting products. This is called the “cold start” problem in the world of product recommendation; however, this can be overcome with well structured meta-information on both products and customers. In the example posed here, a shopper who searches for “large screen laptop for gaming” should immediately filter out laptops with 13’’ screens; however, if the product catalog doesn’t have accurate or even complete screen information the recommendation system will fall flat on its face – regardless of what unique algorithm is providing the ranking.

  • Candidate Ranking: Once a set of potential products have been selected, a product affinity model (like the one we’re going to build) can provide the next level of personalization by enabling the “final mile” of personalized advice that might have been delivered by an acute sales associate. In this example, the savvy sales professional might have an internal “sense” that this shopper – who we know to be Design-Centric – would appreciate the overall aesthetics of the 3 screen Razer Stealth.

Predicting Consumer Product Affinity

So, how do we actually teach a machine to predict whether a customer might want for the products a shop might carry? The correct way to approach this problem to re-frame the task from predicting the abstract human emotion of desire to one of pragmatic affinity. By using data from scenarios where a user has expressed a known affinity (i.e. – the product rating that user might give a laptop), we can construct an abstract model for how all users might perceive that laptop.

The machine learning task, then, becomes one of predicting a user’s possible product rating given their history of ratings on similar products.

Selecting the Right Model

Before we dive into the specific algorithms we will leverage and use to train our model, we first start with the general framework: the large, sparse User-Product affinity matrix. Most recommendation engines are based on a concept called Collaborative Filtering, which is simply the idea that like-minded people are likely to show strong affinity for the same products. The key to making this work in practice is how we define “like-minded people” – in most cases this typically resolves to the recursive reference: “people are like-minded if they like similar products” and “products are like-minded if they are liked by similar people.” This reference is similar to the recursive principles of authority that score relevant documents on the web (PageRank), where “an web page is authoritative if other highly authoritative pages link to it.”

Conceptually, we must imagine a giant matrix where one dimension is all of the possible users of a website and the other dimension is all of the possible products those users could potentially like and/or purchase. As you can imagine, this is a LARGE matrix and in reality it is very sparse – meaning only a small portion of user-to-product affinities will be known (this is called the “sparsity” problem in the field of collaborative filtering based product recommendation techniques). Our goal, from a machine learning perspective, is to build a model that can predict the full matrix (Q) using what – in reality – is often less than 0.01% of known user-product affinities.

We Need to Approximate All Preferences

The trick to solving our problem is to setup a model that will allow us to approximate the entire matrix Q based on a much smaller set of matrices called “latent factors” (see the picture on the left). The reason for this breakdown is a critical concept to recommendation engines: the user-product affinities we see in the world are actually the result of preferences being formed over a (relatively) small number of factors that are “hidden” from the world but drive end behavior (kind of like how we think of the human mind forming concepts as internal mental representations). These “latent” factors can then be divided amongst users – who have different affinities for those factors – and products which have these factors in various degrees. The resulting large matrix Q is then simply the dot product of these two latent User and Item factor matrices.

There are a variety of algorithms that can be used to estimate these latent preference matrices; however, we will use a technique called Alternating Least Squares as it has both a clean implementation in Apache Spark’s standard machine learning libraries and also provides clean User and Item factors that can be used later on for potentially derivative machine learning tasks involving customer preference formation and usage. I will not cover the details of the algorithm or the math behind how it converges on a reasonable approximation using a small amount of training data as many other sources exist to provide that level of detail.

Our Data Source: Movie Ratings

All machine learning approaches require us to start with a defined set of known product affinities. In this example, we will be using the canonical MovieLens 20 million dataset which is a collection of explicit user ratings on movies collected by the GroupLens project from the Social Computing Research center at the University of Minnesota. To get the dataset, simply go download a copy of the ml-latest zip file and make sure it contains the large CSV called ratings and movies. This file is simply a flat file containing a specific user ID, product ID, and the rating that was given.

First – Determine the Correct Hyperparameters

As with many other machine learning algorithms, Alternating Least Squares (ALS) has several hyperparameters that need to be selected during the training process. One of the most important element is called the Rank of the resulting latent factor matrices. This is essentially the number of different “factors” that we want to represent the large matrix Q by – essentially the different number of dimensions a user might be weighing in their mind when determining their personal affinity for a film.

As you can see from the left, we break the data into a training and test set and leverage the Spark ALS API to train different models with different hyperparameter settings. The method for testing the accuracy of the resulting matrix factorization model is called Root Mean Square Error – it is a measure of how well the resulting latent matrices (when multiplied together) can reconstruct the known product affinities.

Then – Train on the Full Dataset

Once we have a set of hyperparameter settings that reduce the RMSE the most, we use these settings to train on the entire data set. In this use case, I’ve selected a rank of 10, lambda of 0.1 and 10 iterations of approximation. We then save the model to disk both more efficient retrieval and later re-use.

Now – Let’s Use the Model

With the model trained, we can now use the resulting latent item factors to find similar films and make recommendations – regardless of what we do or do not know about a user. Remember that for our use case of a product carousel on a product page in an ecommerce site we want to recommend other products a user might like. A good starting point for this recommendation would be to use the product a shopper is currently looking at to make the recommendation.

Let’s say that a shopper has been looking for some good Nicolas Cage films and stumbles upon the legendary Cage masterpiece, Con Air. While I’m sure those who have seen Cage’s southern rendition of an ex-con caught up in the wrong place at the wrong time would never desire to see anything else, let’s say – for sake of mental exercise – that we wanted to find other similar films to Con Air.

Since we have determined a latent item matrix Y where each movie is represented in a factor space of rank R that represents how the movies relate to each other in terms of known product affinities, we can use these pre-computed vectors to find similar films in the factor space using simple Cosine Similarity distance.

Finding Similar Films to a Given Target

Leveraging the Spark SQL API and merging our recommendations with the movie list (available in the MovieLens download), we can find the top 20 most similar films to Con Air in terms of what other people have preferred. Looking at the results, we see a lot of nice blockbuster action films like Bad Boys or even another famous Nicolas Cage film, The Rock. In fact, Nicolas Cage makes an appearance in quite a bit of the top recommended films – this is not surprising since our model is based on what similar people that like Con Air have also liked. The data here suggests that Nicolas Cage fans are quite loyal to the action star – and as a fellow fan, I must say that is quite true!

This is actually a very common starting point for most recommendation engines and in our prior model encompasses the “Candidate Selection” phase of machine based product recommendation.

Online vs Offline Computational Requirements

While we could stop here and get some great extra bang for our buck by releasing the model into the wild (which might not be a bad idea for agile development teams), this model would show the same top recommended movies for every single visitor – somewhat defeating the purpose of a “personalization” engine. The next step in our development process is to determine a way to tailor the recommendations to what we know about a particular user – enabling us to provide a true personalized experience.

This personalization process will take place in the “Candidate Ranking” phase of our systems workflow and there are a couple of questions we have to ask ourselves with regard to its design. The primary question is with regard to how quickly do the recommendations need to be provided. There are two approaches to making predictions – either in a batch offline process or in real-time upon request. If we have the luxury of creating recommendations offline (such as when we want to select personalized products to put into a customer email that we control the time of sending), then we can simply take the user’s collected affinities (from reviews / clickstream) and re-build the latent factor model. This is the most straightforward approach and is guaranteed to incorporate a greater deal of nuance in the recommendations than real-time models.

However, if we need to personalize in real-time (or if this is the first session we have seen the user in on our site), then we need a more creative way to approximate a specific user’s preferences P(u).

Modeling Customer Preferences from Behavior

Let’s say, for example’s sake, that we want to personalize the movies recommended when a user finally lands on the Con Air product page after a normal shopping session. Perhaps they search initially for “Nicolas Cage” and then spend varying degrees of time on different Cage films, such as Gone in 60 Seconds (a classic!), The Rock, or even Lord of War. While the user did not provide explicit ratings of each film – as the users from which we built our model – we can still use this implicit browsing behavior to approximate what their known ratings might have been (or rather to at least weight the recommended results towards implicit taste).

In this case, a common technique in the realm of digital commerce is to use the length of time a shopper spends on a particular product page as an indicator of relative affinity. This is known in the industry as user “dwell” time and is likely already being used by state-of-the-art personalization engines on most sites that you shop on today.

Some Quick Math

While capturing and quantifying dwell time can help us approximate the known product affinities P(u), in order to get the unknown product affinities P’(u) we need to do a little linear algebra. On the left, I show how we can approximate what the visitor’s User factor vector (X(u)) might be by using the identity property to substitute into our original approximation equation.

Computing Full Preferences (In Real-Time)

Once we have our approximation for the User factor vector X(u), we can use this in conjunction with various weightings of the movies from the user’s browsing session to personalize the movie recommendations once they land on the Con Air product page. Because this calculation is a simple vector dot product, this method can be used to personalize results in real-time without having to re-build or adjust any machine learned model.

Let’s See the Results

To test this out in our toy model, I’ve simulated two very different browsing scenarios that attempt to emphasize different kinds of Cage fans.

  • [A] “Lord of War was my JAM!” – This shopper shows an affinity for fast-paced, action films that highlight themes related to weapons and military. In this case, they spent a lot of time with Lord of War (a movie where Cage plays a high-profile gun trafficker) and The Rock (a film where Cage and Connery work together to take down a madman general). The engine recommends Assassins, a Stallone classic that is (not surprisingly) very similar to Con Air in that the lead character wants to leave his life of crime but get brought back in due to uncontrollable circumstances. The second film, Rapid Fire, is less well known but seems to also feature a reluctant character who is forced into action by uncontrollable circumstances – both films seem to provide the desired association to weapons and military.

  • [B] “National Treasure was the best Cage ever…” – While I may not fully agree with the tag line, this browsing session is meant to showcase a user who appreciates Cage’s ability to provide humor and play non-warfare oriented rolls. As a result, Assassin is still recommended, but the second film – Blue Streak – is a much more light-hearted cop comedy with Martin Lawrence where again the protagonist is forced into a crime situation undesirably but – in theory – should tailor more to the user’s implicit preferences for humor above action.

How Can We Improve the Results?

Now that we’ve built a working real-time personalization prototype, the next step would be to design a way to go A/B test it on real customers. This is beyond the scope of this article; however, assuming that a digital team was able to successfully deploy and test an initial model, the next logical step any practicing data scientist is going to focus on will be how to improve the results above some measured baseline. With regard to the approach we’ve outlined here, there are two important approaches that can be taken:

  • [A] Leverage Structured Meta-Data to Collapse / Expand Matrix Q: The quality of the product recommendations all result from how strong of a learning signal is present in the training data that feeds the User-Product affinity matrix Q. Because this matrix, at scale, can be quite sparse – a powerful technique for improving results is to leverage structured meta-data to reshape the training data prior to matrix factorization. For example, if you have well structured product attribute information on you products (i.e. – every product has normalized, accurate height, color, style, occasion, etc.), you can collapse the size of the matrix P along a single dimension by consolidating categories of goods into attribute clusters. This will remove a great deal of sparsity from the overall matrix and provide richer and more accurate product recommendations. This approach is generally called “Neighborhood Methods” in the literature.
  • [B] Increase the Training Dimension Along Time: One of the key insights into product recommendations is that consumers’ tastes change over time – in fact it was this insight that led the winning team for the Netflix prize to introduce a winning model called TimeSVD++ that took into account the sequence of people’s ratings to allow for greater insight into the ratings that actually matter.