My nodeJS project is finally completed, so I could turn back to Sitecore and look for some interesting questions and solutions for them. Fortunately, I recently got one request.
We have an e-commerce solution integrated with Sitecore, we also have DMS in place and it is gathering statistics about users and their behavior on the site. We could add personalization to the site trying to engage more clients, but it would be quite static, as a limited set of prepared content blocks would be shown based on defined rules. This is cool when you want to show promotion, but what if you need to show relevant products or pages.
To calculate relevant products we need to match a profile of a current user with attributes of a product. We would use aggregated values from DMS in a visitor profile as a target pattern and product page tags as weighted categorization info. Also to match visitor profile with the product we would use SOLR relevancy engine.
Firstly, relevant products are, obviously, a personalization of content. Taking into account that Sitecore uses weights in profile keys to tag the content and patterns to match them in runtime in a built-in personalization engine, it would be wise to use some parts of this engine for our solution.
Secondly, if you define any profile categories in DMS you would like to use this data, rather than create to categories or tags for a product once again. (Thanks to Martin Davies and his video for insights).
And last part of this is a pattern matching itself. As we are talking about e-commerce and relevancy to a visitor, you might guess that number of queries that we need to make could be very significant and if you have thousands of products problem is even worse. Let’s see how SOLR could help us with it.
Profiling and patterns
Let’s take a look at profiling. In Sitecore profile might have several profile keys with weight scale (let’s assume that we have two keys “entertainment” and “education” in profile “theme” and scale from 1..10 in each, this might be useful for further examples). It is possible to assign some score to a product, “The Mysterious Island” book e.g., a score according to this profile (8 for entertainment and 4 for education). Surfing around a site a visitor would hit pages with profiles and gather points in categories defined by it (e.g. 7 for entertainment and 7 for education). Both product and visitor now have a profile, but how to compare them and say that 8 & 4 is closer to 8 & 1 that 7 & 7.
If you think about values, they could be represented as a point in 2-dimensional space which could give you some math to compare them at least by distance. Or as a vector starting at the origin of coordinates system, which would provide you with a whole set of operation you could make with them: merge (vector add), calculate the distance (subtract, length), compare directions (analyze angle), align scales (normalization). In our case comparison of direction of vectors is the most interesting and it is defined by cosine similarity.
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
SOLR and patterns Matching
Solr should provide you with most relevant results based on your query. The first part of query execution is Boolean logic (do we have a term in a document or not), but after that, we have more interesting part with weighing and judging which result would be the most useful.
This part using cosine similarity as a logical basis but also adds other elements to an equation, as it needs to deal with boosting and terms frequency.
I would not put here the whole description of Solr Default Similarity (see link), but it is important to highlight some parts of scoring function where individual term scores are calculated:
- tf(t in d) correlates to the term’s frequency, which means that the more you have some token in a field, the bigger score for this field would be;
- t.getBoost() represents boost of a term in a query;
- idf(t) correlates to the inverse of docFreq (the number of documents in which the term t appears);
Combining all together
First two params could be used to implement patterns matching, while third might negatively affect relevancy (you do not really care how often you category presented among the whole catalog of products) but is could be easily disabled.
To use term frequency, we need to create computed field for each profile in Sitecore Tracking field and during products indexing duplicate profile key name (or ID) n times, where n – profile key value. E.g. if you are searching for products with “entertainment” profile key, products where this key was mentioned more times would be higher – which is obviously a part of the goal.
Another part is to give a higher score to terms which are more relevant to visitors profile and here boosting would take place, as we could add individual boost values to terms according to profile keys in visitors profile.
Strength of the approach:
- This approach would work well with product promotion in search implemented via documents boosts at index time.
- It also should be quite scalable as we are using standard SOLR query and indexing options.
- Less computing load on Sitecore instance, which mean relevant products might be heavily used.
- Products or any other pages might be matched against visitors profile, current page profile or even merged value.
- You should maintain the scale of profile keys reasonable like up 10 or 20, as you need to duplicate its name during indexing.
- Also, a scale should be same for all profile keys as one could become more important just because of different scales.
- Might require advanced updated of SOLR configurations.
And disclaimer: as you probably understand by absence of code this is a solution design, but I hope it won’t take too long to write some code for it =)
Follow me on twitter @true_shoorik. Would be glad to discuss ideas above in comments.