Text Embeddings in Practice: Building a Cold-Start Content Recommender
Over the past year, text embeddings have become central to my work on improving recommender systems. Text embeddings tutorials cover theory well, but production deployment raises unanswered questions about model selection, storage architecture, and validation.
This article brings together what I learned during that process, and I will walk through a realistic use case—building a related content system for a video streaming platform—and show the practical choices you will face: selecting models, choosing storage, validating results, and deploying something stakeholders can actually test.
First, the fundamentals of text embedding.
Text Embeddings: Core Concepts and Historical Context
What Are Text Embeddings?
A text embedding is a vector, formally, it is a representation of a piece of text as a vector in a multidimensional latent space. Like any vector, the embedded text is represented by a direction and a magnitude, positioning texts with similar meaning or characteristics closer to each other. Word2Vec demonstrates this through vector arithmetic: the vectors for King - Man + Woman approximate Queen, illustrating learned semantic relationships by an embedding (mapped to geometric operations).

Evolution of Embedding Methods
To create representations that capture text features, there are multiple approaches that followed the evolution of computing power. These approaches can analyze a single word, a sentence, or a full document. Each approach merits detailed exploration; here I provide a grouping for understanding their evolution and trade-offs:
- Count-based methods (1950–2010s): These methods represent words or documents as high-dimensional sparse vectors based on word counts or co-occurrence statistics.
- Keywords: Bag of words, TF-IDF, Latent Semantic analysis
- Predictive embeddings (early 2010s): Emerged with the resurgence of neural networks, learning dense vector representations that predict words from context.
- Contextual embeddings (late 2010s): Producing representations that vary depending on the surrounding text.
- Document embeddings (2020s): A new evolution of predictive embeddings, designed to represent entire documents rather than individual words.
- Keywords: Sentence-BERT, E5, GTE
Text-to-Embedding Pipeline
Despite their differences, all embedding approaches follow a common pipeline to transform text into vector representations and is composed of the following steps:
- Text normalization: Normalizing the text standardizes input before embedding.
- Keywords: Lowercasing, dropping punctuation, stemming
- Tokenization: Machines cannot really understand raw text as is, so it must first be segmented into smaller units called tokens that models can process. Text can be tokenised at different levels like character, subword or complete word that offer a different level of encoding (and information)
- Keywords: word/subword-level, Byte Pair Encoding (BPE), Wordpiece
- Vocabulary indexing: For each embedding model, there is an associated vocabulary that maps every possible known token to a tokenID. Applied to the full text it will become a sequence of tokenID.
- Keywords: lookup indexing
- Vectorization & weighting: This step transforms discrete token IDs into numerical vectors. It maps each token to a feature representation, derived from statistical frequency or learned parameters based on the model and refines these values to capture the token’s semantic importance relative to the text or the entire dataset.
- Keywords: static embedding, lookup table
- Encoding & aggregation: Aggregate token-level representations into a final vector that represents the text
- Keywords: mean/max pooling, CLS token
The exact pipeline varies by approach. Count-based methods like TF-IDF require
explicit text cleaning and normalization, while modern transformer-based models
handle case sensitivity and punctuation internally.
With embedding generation covered, let’s talk about our real use case.
Cold-Start Content Discovery
For this use case, I will play the role of a data scientist who is part of the personalization and discoverability team of a video streaming service called Netprimey+. The team is working on new features for the platform, and one of the stakeholders in charge of the product page reached out with the following request:
“Hey, we need to redesign the related items section on a content page. Right now, if the content has historical data, the system shows the closest items based on embeddings from the collaborative filtering pipeline. If there is no historical data, it shows the most popular content from the last hour. We want something that does not rely on historical data and can link content to other content based on general information like cast, crew, and content details when a new item is available on the platform.”
This is a valid request, and we already have a lot of information for each content on the platform. Among all this data, there is a product description in English created when an item is added. This description contains most of the information mentioned by the stakeholder, so let’s experiment with it.
Currently, we have a catalog of more than 200k contents/items on the platform. Here are some example contents:
- Asterix and Obelix, the Big Fight: French animated limited series on Netflix
- Superman: the 2025 movie by James Gunn
These are two contents with different formats and from different countries. Below is an illustration of their associated descriptions in the system.

Now the question is: how would embeddings enable this feature? Create an embedding for each content description on the platform. The related content for a given item will be the nearest vector in the embedding index.
With the approach defined, the next challenge is choosing which embedding model to use from the thousands available.
Model Selection
In 2025, some good places to start if you want to work with embeddings are:
- Spacy and NLTK to start with text processing in general
- scikit-learn feature_extraction and PySpark MLlib if you want to start with count-based approaches
- sentence-transformers from Hugging Face if you want to explore more modern approaches
The MTEB Benchmark
Currently there are more than 18k models available on the Hugging Face platform that are compatible with sentence-transformers. The question is how to choose the right model. To answer this, let’s take a step back and see how an embedding model is created.
Embedding models are optimized for specific tasks, in a specific context defined by languages, domains, and text sizes. For example, some models are trained on English corpora to group similar texts together, which is exactly what we want to do here.
Model selection should prioritize training data that matches your target language, domain vocabulary, and typical document length, but the task space is broad; I grouped the most common tasks into three families:
- Analysis tasks: understanding a piece of text by testing embeddings on classification, summarization, or clustering tasks
- Comparative tasks: understanding pieces of text together by testing embeddings on Semantic Textual Similarity (STS), bitext mining, or pair classification tasks
- Search and discovery tasks: finding a piece of text among a large amount of other texts, including retrieval, instruction retrieval, and reranking tasks
With that in mind, how do we still find the right model? Hugging Face and its contributors created the Massive Text Embedding Benchmark (MTEB) and its multilingual version MMTEB. They defined a pipeline to benchmark models on various datasets and tasks and expose the metrics on a public leaderboard.
If we go back to the related content page use case and look at the benchmark configuration to find the right model:
- The dataset: it is composed mostly of English words, but there are also names of actors or characters in the original language of the movie. Descriptions have different lengths but do not exceed 700 words, and most of the time they are under 100 words.

- The task: finding related content based on text descriptions is a Semantic Textual Similarity (STS) task, as we want to measure how similar two descriptions are.
Choosing a Model
Beyond pure MTEB performance, production constraints shape model size selection significantly. The right choice depends on your workload type, and for reference I am defining small model as < 500 millions parameters with embeddings of fewer than one thousand dimensions.
Live inference (user-facing search or chat)
Default to a small model. Large models introduce latency that will hurt the user experience unless you have substantial GPU capacity. Two conditions are automatic small-model decisions regardless of anything else: a sub-100ms latency requirement, and CPU-only or VRAM-constrained hardware. The only case where a large model justifies itself in a live setting is when your data contains long documents (>512 tokens) or dense technical jargon — and you have the GPU capacity to absorb the cost.
Batch / offline indexing
Again, start small. Parallelizing a small model across cheap CPUs or small GPUs will almost always beat a single large model on throughput. Move to a large model only when accuracy is your sole KPI and you index infrequently — the extra compute becomes a one-time cost for permanent quality gains.
From live to batch inference, you could have many other conditions that apply to the use case you put in place. I tried to compile the most important ones for me, and a related model size to keep in mind.
| Condition | Recommendation |
|---|---|
| Live — sub-100ms latency required | Small |
| Live — CPU-only or limited VRAM | Small |
| Live — long docs (>512 tokens) + GPU available | Large |
| Batch — minimize time-to-index | Small |
| Batch — storage or DB costs are a concern | Small |
| Batch — one-time index, accuracy is the only KPI | Large |
Life is not binary
Beyond the divide between small and large models, their size, number of parameters etc., there is also one thing to keep in mind: the method that has been used to train these models, as some are more efficient than others. For example the Matryoshka Representation Learning (MRL), MRL-trained models produce embeddings where the first N dimensions capture most of the semantic information, allowing you to truncate a 768-dimensional vector to 128 or 256 dimensions with minimal quality loss.
My selection for content discovery exploration
For the PoC, I decided to pick two models to compare and test:
- sentence-transformers/all-MiniLM-L6-v2: a popular Hugging Face embedding model used in many tutorials. It is English only and can handle texts up to 256 word pieces, according to the model card.
- google/embeddinggemma-300m: one of the new Google Gemma models. This text embedding model is lightweight, trained on web text in more than 100 languages, and can handle a context length of 2048 tokens.
These are not the largest or best-ranked models — they are ranked 18 and 131 on the STS task in the MTEB leaderboard as of December 2025. However, they run efficiently on my laptop, and I want something quick to prototype a first iteration and do a simple comparison. Below is a representation of the selected content in their embedding space, displayed as a heatmap. (inspired by this article from Stack Overflow)

Now, we need to store 200k vectors somewhere that they can be queried efficiently.
Storing Embeddings with a Vector Database and Indexing
The embedding is an array composed of thousands of floating-point values, so it can easily be stored in any modern database that supports array storage. However, in recent years, with the rise of LLMs (Large Language Models) and applications like RAG (Retrieval-Augmented Generation), a new type of database has appeared: the vector database.
A vector database is a specialized database that stores high-dimensional vectors generated by embedding models and indexes them for efficient similarity search. Unlike classic databases that use indexes for exact matches or range queries (B-tree, hash) and return results based on equality or filters, vector databases use dedicated indexes (such as HNSW or IVF) to quickly find nearest neighbors in vector space, retrieving data that is semantically similar rather than exactly matching.
Consider whether you need a vector database at all—for small datasets (under 1,000 documents) or non-search use cases, simpler solutions may suffice.
But in case of a vector database, when selecting one, you should focus on the trade-off between operational simplicity and architectural performance. There are many companies offering this kind of service, like Pinecone, Qdrant, or even cloud providers with their own solutions. There are many benchmarks online, like this one from Qdrant. Each one can be biased, so I strongly recommend that you test 2–3 different vector databases on your own use case to see which one best meets your needs.
For my prototype, I decided to start with a lightweight, vector-database-like setup using a package called Voyager from Spotify, that can be easily connected with sentence-transformers. Here is a code snippet to store embeddings in this index:
NB: When you are putting in place this kind of storage for embedding, it’s important to have incremental indexing in place to track your item catalog. Beyond the storage of the embedding in the database it can also help when building recommender systems for other techniques like collaborative filtering algorithms that can rely on this indexing for the matrix factorisation phase. There is a quick code snippet showing how to build an incremental index in PySpark. You can do something similar in pure Python or any other language.
Let’s go back to our use case — finding similar content and it requires distance metrics.
Similarity Metrics and Validation
Embeddings are defined by their magnitude and direction in space, so a key step is to compare embeddings together to extract information from them. The main measurements used for this comparison are:
- Cosine similarity: computes the cosine of the angle between two vectors and focuses on their direction rather than their magnitude.
- Euclidean distance: computes the straight-line distance between two points in the vector space.
- Dot product: evaluates similarity by multiplying corresponding components and summing the results. It combines both magnitude and direction information.
Metric selection depends on the task you want to solve with embeddings. For STS tasks, cosine similarity is preferred because it measures semantic alignment independent of document length, focusing purely on directional similarity in vector space.
Normalization plays a key role here: by normalizing embeddings to unit length, magnitude information is removed and cosine similarity becomes equivalent to the dot product, which is useful when comparing meaning only. However, in applications where embedding magnitude carries meaningful information (for example, content rarity, confidence, or importance), keeping embeddings unnormalized and using dot product or Euclidean distance is more appropriate, as these measures take both direction and magnitude into account.
In our related content case based on the content description (a typical STS), L2 normalization is appropriate (most model libraries handle this by default) and cosine similarity is the right metric for this task. When I created the index with Voyager, I specifically chose a space defined by cosine similarity, so I can easily interact with all the points in the index using this measurement.
To showcase this, I had two validation scenarios in mind that can support the related content feature:
- Get the 5 closest contents to a given content
- Get the closest contents shared by two contents
So let’s extract the 5 closest contents using each embedding model.

As we can see, MiniLM’s nearest neighbors seem to be more influenced by the beginning of the description, often giving more weight to the title. In contrast, the Gemma model appears to better capture the full description. For Astérix, it surfaces movies with real actors and Alain Chabat, and for the Superman example, the closest results are upcoming sequels that share part of the cast.
If we now look at the closest contents to our selection together, we computed the top results with the two models.

Gemma seems to prioritize animation content related to Superman, whereas MiniLM returns a mix of results, including older animation like Superman 1941 and The first Big Fight Astérix, as well as more recent content such as the documentary Christopher Reeve: Superman Forever.
Public benchmarks like MTEB measure general performance, but they won’t tell you if your embeddings work for your specific use case. For Netprimey+, we needed to know whether “Asterix and Obelix” correctly links to other French animated series—not whether the model scores 0.85 on academic benchmarks.
Building a domain-specific validation set ensures embedding-powered features meet business requirements, which generic benchmarks cannot guarantee, and it could take different forms based on your use case. For the Netprimey+ use case, if I were to design a validation dataset, I would design it with three components in mind:
1. Positive pairs: Content that should match
- Same franchise: “Breaking Bad” ↔ “El Camino”
- Similar genre: “Stranger Things” ↔ “The Umbrella Academy”
2. Negative pairs: Content that shouldn’t match despite surface overlap
- Same actor, different genres: “Pursuit of Happyness” ↔ “Men in Black”
- Similar titles, different shows: “The Office” (US) ↔ “The Office” (UK)
3. Edge cases: Scenarios that expose model weaknesses
- Multilingual: “La Casa de Papel” ↔ “Money Heist” (same show, different languages)
- Remakes: Lion King (2019) ↔ Lion King (1994)
Beyond ranking performance, building a validation dataset could also enable fine-tuning, and stakeholder communication with concrete examples.
Exploring and Showcasing the Embedding Space
Beyond building leaderboards and comparisons like that, there is also an aspect of representation that is important to do with embeddings in a 2D space so it can be interpreted by a human (my previous heatmap visualisation is not very scalable to thousands of embeddings). To make a visualisation where all the embeddings can sit in the same space, you can use a technique called dimensionality reduction, which reduces dimensionality while preserving structure based on inter-embedding distances. There are many algorithms available for this, such as PCA, t-SNE, or UMAP, but it would be worth making a dedicated article just around this topic (and how to choose the right one).
These days UMAP is one of the most popular techniques from what I see on various papers and there is an efficient implementation in Python here. I applied a basic UMAP to my dataset and here is a UMAP visualisation of the descriptions in 2D for the two models.

Being able to explore the space of your embeddings is important, and having some interactivity is also essential. You can build a tool to support this exploration, but I recently found that Apple open-sourced a package called Atlas that provides an interface for exploring embeddings. The package can handle both embedding creation and dimensionality reduction, and it offers a clean interface to explore the reduced embeddings and any extra information related to them.
A perfect start if you want to quickly explore the space and the points you have built in it. But exploring the space can still feel a bit abstract for stakeholders, so my main advice is this: deploy a standalone demo of your feature early, using your embeddings. A simple web app, accessible 24/7, is better than waiting for full UI integration or relying on notebook examples. This way, stakeholders can play with it anytime, get a clear sense of how it behaves, and provide feedback.
For the related content use case, I deployed a tool with Streamlit and Hugging Face Space to explore the catalog and simulate a content page, showing what a related content row might look like:
(In the free tier, I could not add Gemma and MiniLM indexes together, but the idea is to give the tool the ability to choose the embedding mode.)
Closing Notes
Text embeddings are simple on the surface but finding the right model to make the conversion and deal with the vector storage are not something to miss and I hope that this article has given you some keys to improve your selection to go beyond the bigger model / more resource way.
I am still actively experimenting with embeddings, so you can expect follow-up articles in the coming months where I dig deeper into this.
Thanks to Josiane Van Dorpe for reading a draft of this.
References
- Bag of Words model — Wikipedia
- TF-IDF — Wikipedia
- Latent Semantic Analysis — Wikipedia
- Word2Vec — Wikipedia
- Doc2Vec — Gensim documentation
- GloVe: Global Vectors for Word Representation — Stanford NLP
- FastText — Meta AI
- ELMo — Wikipedia
- BERT — Hugging Face documentation
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks — arXiv
- E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training — arXiv
- GTE: General Text Embeddings with Multi-stage Contrastive Learning — arXiv
- spaCy — official site
- NLTK — official site
- scikit-learn feature_extraction — scikit-learn documentation
- PySpark MLlib — Apache Spark documentation
- Sentence Transformers — Hugging Face
- MTEB: Massive Text Embedding Benchmark — arXiv
- MMTEB: Multilingual Massive Text Embedding Benchmark — arXiv
- MTEB: Evaluating a Model — MTEB documentation
- MTEB Leaderboard — Hugging Face Spaces
- Matryoshka Representation Learning — Hugging Face blog
- sentence-transformers/all-MiniLM-L6-v2 — Hugging Face model card
- google/embeddinggemma-300m — Hugging Face model card
- MTEB (GitHub) — embeddings-benchmark
- An Intuitive Introduction to Text Embeddings — Stack Overflow blog
- Vector Search Benchmarks — Qdrant
- Embedding storage with Voyager — code snippet — GitHub Gist
- Incremental index in PySpark — code snippet — GitHub Gist
- L2-Norm — Wolfram MathWorld
- UMAP: Uniform Manifold Approximation and Projection — documentation
- embedding-atlas — Apple, GitHub
- Netprimey+ Related Items demo — Hugging Face Spaces