Semantic search and everything related to machine learning has become a very popular topic. To be honest, it’s not only semantic search itself but, due to the massive popularity of so-called Large Language Models and more, many organizations using Solr are trying to implement various query logic based on machine models, RAG techniques, or rescoring. Since it’s been relatively quiet on this front for a while, it’s time to delve into the topic. In this post, we’ll focus on search and explore what Solr has to offer for data retrieval based on vectors.
Short Problem Description
Let’s assume that our documents consist of a series of fields used for full-text search, along with a field that we want to use for semantic search. This field contains tags that describe the document.
It’s important to highlight that this is not an ideal data example—for instance, we won’t be using context, the tags themselves are single words, and we also want to build a single vector for the document, which doesn’t make the best use of tags. Nonetheless, we’ll proceed with this approach and index the prepared data in Solr.
Solr Preparation
Preparing the data is a bit more complex. We won’t create our own model but will instead use one available on Hugging Face, specifically https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. This is one of many models that enable work with sentences and paragraphs. It’s important to remember that calculating the vector involves not only analyzing the words or the sentence itself but also the context in which they occur. The resulting embedding encodes all the relevant information, allowing algorithms to find similar documents. We’ll take a shortcut and use this model to process the tags field, but as I mentioned earlier—the model isn’t the main focus today; we’re interested in Solr.
To do this, we’ll use Solr 9.7.0 and a field based on the solr.DenseVectorField class. We’ll start by creating a new field type that looks like this:
{
"name":"knn_vector_384",
"class":"solr.DenseVectorField",
"vectorDimension":384,
"similarityFunction":"cosine",
"knnAlgorithm":"hnsw"
}
Here, we have the name knn_vector_384, the class solr.DenseVectorField, a vector dimension of 384, the cosine similarity method for vector comparison, and the proximity calculation algorithm – hnsw – which is currently the only one available. The full documentation for the solr.DenseVectorField class can be found in the official documentation. It’s worth mentioning that we’ll be using a model that generates 384-dimensional vectors, which is why we define our type in this way.
Our documents will consist of five fields:
- identifier (id field),
- name (name field),
- embeddings (vector field),
- product category (category field),
- product tags (tags field).
To create the test collection and prepare it for indexing the documents we can use the following commands:
$ bin/solr create -c test -s 1 -rf 1
$ curl -XPOST -H 'Content-type:application/json' 'http://localhost:8983/solr/test/schema' --data-binary '{
"add-field-type" : {
"name":"knn_vector_384",
"class":"solr.DenseVectorField",
"vectorDimension":384,
"similarityFunction":"cosine",
"knnAlgorithm":"hnsw"
},
"add-field" : [
{
"name":"vector",
"type":"knn_vector_384",
"indexed":true,
"stored":false
},
{
"name":"name",
"type":"text_general",
"multiValued":false,
"indexed":true,
"stored":true
},
{
"name":"category",
"type":"string",
"multiValued":false,
"indexed":true,
"stored":true
},
{
"name":"tags",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true
}
]
}'
Data Preparation
Preparing the data is a bit more complex. Instead of creating our own model, we’ll use one of the available models from Hugging Face, specifically the model sentence-transformers/all-MiniLM-L6-v2. This is one of many models capable of processing sentences and paragraphs. We’ll take a shortcut and use this model to process the field with tags, but as I mentioned before, the model itself isn’t our main focus today—we’re focused on Solr.
To index the documents we will use the following Python code that we store in a file called index.py:
import pysolr
import uuid
from typing import List, Dict
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
class DocumentEmbedder:
def __init__(self):
self.model_name = "sentence-transformers/all-MiniLM-L6-v2"
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.model = AutoModel.from_pretrained(self.model_name)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def get_embedding(self, text: str) -> np.ndarray:
encoded_input = self.tokenizer(
text,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
encoded_input = {k: v.to(self.device) for k, v in encoded_input.items()}
with torch.no_grad():
model_output = self.model(**encoded_input)
sentence_embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.cpu().numpy()
class SolrIndexer:
def __init__(self, solr_url: str = 'http://localhost:8983/solr/test'):
self.solr = pysolr.Solr(solr_url, always_commit=True)
self.embedder = DocumentEmbedder()
def create_document(self, name: str, tags: str, category: str) -> Dict:
"""Create a document with embeddings."""
embedding = self.embedder.get_embedding(tags)
doc = {
'id': str(uuid.uuid4()),
'name': name,
'tags': tags.split(', '),
'category': category,
'vector': embedding.flatten().tolist()
}
return doc
def index_documents(self, documents: List[Dict]):
"""Index multiple documents to Solr."""
try:
self.solr.add(documents)
print(f"Successfully indexed {len(documents)} documents")
except Exception as e:
print(f"Error indexing documents: {str(e)}")
def main():
documents = [
{"name": "Apple iPhone 13", "tags": "phone, smartphone, screen, iOS", "category": "phone"},
{"name": "Apple iPhone 14", "tags": "phone, smartphone, screen, iOS", "category": "phone"},
{"name": "Apple iPhone 15", "tags": "phone, smartphone, screen, iOS", "category": "phone"},
{"name": "Samsung Galaxy S24", "tags": "phone, smartphone, screen, Android", "category": "phone"},
{"name": "Apple iPod", "tags": "music, screen, iOS", "category": "music player"},
{"name": "Samsung Microwave", "tags": "kitchen, cooking, electric", "category": "household"}
]
indexer = SolrIndexer()
solr_documents = []
for doc in documents:
solrdoc = indexer.create_document(doc['name'], doc['tags'], doc['category'])
solr_documents.append(solrdoc)
print(f"Created document: {solrdoc['name']}")
print(f"Vector length: {len(solrdoc['vector'])}")
indexer.index_documents(solr_documents)
embedder = DocumentEmbedder()
embeddings = embedder.get_embedding("song player")
print(f"Embeddings: {str(embeddings)}")
if __name__ == "__main__":
main()
A few comments about the code:
- The DocumentEmbedder class is responsible for creating the vector; if we want to switch models, we only need to change the value of self.model_name. However, it’s important to remember that we configured Solr to work with 384-dimensional vectors, so if the new model produces vectors with different characteristics, a different Solr configuration will be required.
- The SolrIndexer class handles the indexing of documents.
- The documents to be indexed are defined in the documents variable.
To run the code, there are dependencies that must be installed. We tested the code on a MacOS system running on an ARM-based processor. The requirements.txt file looks as follows:
transformers==4.37.2
torch>=2.2.0
numpy>=1.24.3
pandas>=2.1.4
sentencepiece==0.1.99 --extra-index-url https://download.pytorch.org/whl/cpu
pysolr==3.9.0
uuid==1.30
You can install the dependencies and run the code using the following commands:
$ pip install -r requirements.txt
$ python index.py
Querying
When the documents are successfully indexed we can start running the queries to retrieve the documents. For the semantic search we can use one of the two available parsers:
- knn
- vectorSimilarity
When using the knn parser, we get the top K documents, whereas with the vectorSimilarity parser, we retrieve documents that exceed a specified vector similarity threshold. An example query looks like this:
http://localhost:8983/solr/test/query?q={!knn f=vector topK=10}[...]
As shown, the above query uses the knn parser, operates on data from the field named vector, retrieves 10 documents, and in the […] section, our 384-dimensional vector should be placed. How can we generate it? We can modify the previous code and calculate the vector for our query in the main function:
.
.
.
def main():
embedder = DocumentEmbedder()
embeddings = embedder.get_embedding("song player").flatten().tolist()
print(f"Embeddings: {str(embeddings)}")
if __name__ == "__main__":
main()
The generated vector can now be used in Solr. Let’s try querying the tags field with song player, like this:
http://localhost:8983/solr/test/select?q=tags:"song player"
The results, as expected will be empty:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1,
"params":{
"q":"tags:\"song player\""
}
},
"response":{
"numFound":0,
"start":0,
"numFoundExact":true,
"docs":[ ]
}
}
This is the situation we might have expected – we don’t have a document with that exact tag. Our Apple iPod document has one of its tags field values set to music. Close, but not the same as song player, at least from the perspective of full-text search, and we aren’t using any synonyms.
So, let’s see how semantic search handles this. To do this, we need to generate a vector from our phrase. Using the code we’ve seen above, our song player query, after vector generation, will look like this:
http://localhost:8983/solr/test/query?fl=name,tags,score,*&q={!knn f=vector topK=10}[-2.00312417e-02,-8.43894295e-03,4.63341828e-03,-7.11187869e-02,-5.41122817e-02,8.86119753e-02,1.43207222e-01,6.48824051e-02,-2.42583733e-02,5.58690503e-02,1.17570441e-02,-8.10808595e-03,5.87443374e-02,-8.77943709e-02,-2.66626403e-02,9.57359672e-02,-3.80967893e-02,6.07994124e-02,-2.05643158e-02,-4.67214771e-02,-1.52253941e-01,-9.71766468e-03,-6.08050935e-02,2.88447887e-02,-2.94179078e-02,1.03829138e-01,-8.31280462e-03,1.05928726e-01,-5.67206480e-02,-7.29391426e-02,8.77882075e-03,3.76622789e-02,1.22778863e-01,7.63954269e-03,-1.07958838e-01,-2.39758939e-02,-4.40737829e-02,-1.58879627e-02,-8.71627405e-02,-1.86017044e-02,-1.60255414e-02,3.48703228e-02,-3.19783390e-02,-1.32465595e-03,-6.66615218e-02,-5.59806228e-02,-3.83943357e-02,-3.26939276e-03,3.50921564e-02,8.62423256e-02,-4.67025265e-02,3.27739231e-02,1.36479884e-02,2.14965921e-02,9.94029175e-03,8.50510597e-03,6.12449907e-02,1.26747936e-01,1.20850094e-02,8.25305879e-02,-3.12836142e-03,-7.23205507e-02,-1.25491228e-02,-4.12022136e-02,8.32084790e-02,-4.56760749e-02,1.24245733e-02,2.99928896e-03,-1.72226392e-02,1.40204923e-02,-5.27366474e-02,8.62174854e-02,-1.23909842e-02,-5.11718355e-03,-5.19047305e-03,-3.26551199e-02,3.38410065e-02,-2.57184170e-02,4.93465960e-02,2.16306932e-02,5.44718355e-02,-5.13004661e-02,-6.29040822e-02,-1.14791520e-01,2.42656711e-02,-5.12294583e-02,-1.41068548e-02,-1.69403590e-02,-3.93591821e-02,2.21420880e-02,-7.29445517e-02,3.97052318e-02,4.06186096e-02,2.38778368e-02,-2.60719825e-02,4.29015867e-02,8.55240896e-02,-1.29460618e-01,-6.10361844e-02,1.73135296e-01,5.50046675e-02,2.11313833e-02,8.09541941e-02,8.49610151e-05,1.86103079e-02,-7.11245313e-02,-2.20125029e-03,8.21124241e-02,-3.17892991e-02,-5.61453328e-02,3.08496933e-02,-2.24805139e-02,-6.99905232e-02,-1.38647379e-02,2.40420792e-02,1.32726450e-02,3.10667846e-02,1.00519679e-01,-3.77691817e-03,1.09965838e-02,-2.87625156e-02,-3.69284451e-02,-4.66967747e-02,1.17862746e-02,-6.05449453e-02,1.80788971e-02,-7.05301017e-03,-2.43471990e-33,4.00696788e-03,-8.39618146e-02,4.76996899e-02,4.31838222e-02,6.59582578e-03,-2.13763528e-02,1.40098054e-02,4.72629964e-02,-7.40025416e-02,-1.17481779e-02,4.74894270e-02,-2.08981428e-02,-4.10624146e-02,6.36240691e-02,1.19205341e-01,-6.63071573e-02,-5.37127964e-02,1.66804232e-02,-5.87310642e-03,-1.00764409e-02,2.95753870e-02,7.51189962e-02,3.69122699e-02,4.02592048e-02,3.01825050e-02,-3.49015556e-02,3.05062942e-02,-5.49735427e-02,7.76632428e-02,-1.31180901e-02,1.32043369e-03,-7.29682148e-02,4.55190837e-02,-3.90759930e-02,-2.00935621e-02,2.90088002e-02,-4.90487851e-02,-4.68467623e-02,-3.78382057e-02,-8.62234011e-02,-8.21669679e-03,-1.02636190e-02,-4.22154590e-02,-1.01420805e-02,-9.29822177e-02,-4.51174043e-02,1.08679747e-02,4.72158194e-02,1.13880262e-02,2.26492099e-02,1.08083403e-02,4.73796055e-02,-2.11352482e-02,2.94824075e-02,3.32471356e-02,-6.10656738e-02,6.39934689e-02,7.52783706e-03,2.01779045e-02,8.42025783e-03,7.92021900e-02,7.49795884e-02,3.60657759e-02,-5.19200675e-02,-2.61804424e-02,4.30282280e-02,9.71625224e-02,-5.95051460e-02,9.75757018e-02,2.11117323e-02,-3.37768793e-02,-2.30997894e-02,5.96088655e-02,3.78986225e-02,-8.59187767e-02,2.64272131e-02,-5.98726794e-02,-7.63988122e-02,-8.67564604e-03,-4.02248465e-02,-7.74084628e-02,-7.17599411e-03,-6.49854094e-02,5.07407561e-02,-4.55319695e-02,-2.79121399e-02,-3.54578048e-02,-8.02342817e-02,-1.16430726e-02,-3.30321328e-03,-2.42636316e-02,4.07439955e-02,8.91774427e-03,1.66885648e-02,-5.11291474e-02,2.29379760e-33,-2.25291476e-02,-3.75496261e-02,7.23349378e-02,2.80879121e-02,1.11529857e-01,9.52087902e-03,3.09992954e-02,3.44425067e-02,-1.44962538e-02,3.70973274e-02,-3.42840664e-02,-4.15693037e-03,4.41893972e-02,-5.78483054e-03,2.27168929e-02,2.73053981e-02,-7.61534553e-03,3.05881090e-02,2.22553536e-02,-2.11785771e-02,-8.38738605e-02,-3.54573503e-02,5.55671491e-02,-3.25255133e-02,-6.17663078e-02,-3.31615508e-02,1.28180429e-01,-1.09783243e-02,-4.22303416e-02,8.85861460e-03,1.03141055e-01,-1.19012371e-02,-5.68600930e-02,-9.76827294e-02,-3.66187817e-03,6.69727027e-02,5.01095466e-02,3.45795639e-02,-4.87284325e-02,-2.88722366e-02,3.69899273e-02,4.80774455e-02,9.51483287e-03,1.05140977e-01,9.12032463e-03,7.83540029e-03,3.00276410e-02,1.10446520e-01,2.02773716e-02,6.35717139e-02,-6.48668930e-02,-3.63331586e-02,5.26458211e-02,-8.43572319e-02,-6.71359748e-02,-3.95149644e-03,-3.19298953e-02,-3.83070633e-02,-1.30710788e-02,3.45022231e-02,3.50211672e-02,-1.17928414e-02,-2.96711233e-02,-1.07177338e-02,-3.55232134e-02,1.01115681e-01,3.25268582e-02,3.25036608e-02,-1.87526215e-02,-2.26221383e-02,1.12875113e-02,5.46618700e-02,-9.37254541e-03,5.56712523e-02,-4.75139245e-02,8.30354728e-03,-1.24001637e-01,7.75059313e-02,-2.27639657e-02,-7.71744251e-02,3.50985602e-02,-1.18714431e-02,2.85322741e-02,-1.23035219e-02,3.05858813e-03,-3.02043278e-02,7.86332041e-02,2.63012294e-02,-2.10798401e-02,-2.86972001e-02,4.53165434e-02,5.47545776e-02,-1.12766981e-01,4.01742896e-03,-7.09661469e-03,-1.23211663e-08,-8.58379975e-02,1.30043477e-02,2.08473746e-02,-4.64787520e-02,5.16857654e-02,1.19483555e-02,3.52647863e-02,-1.09199435e-01,3.83973494e-02,3.78849730e-02,3.93284820e-02,-7.54635110e-02,8.52666888e-03,7.53920805e-03,2.69943215e-02,-2.04220396e-02,-5.33770137e-02,9.14978534e-02,-5.69638461e-02,4.34034877e-02,1.58696324e-02,6.27059937e-02,1.16019472e-02,-3.03725917e-02,2.53463928e-02,-1.08486209e-02,-4.15410772e-02,6.96198866e-02,1.25364019e-02,8.21540307e-04,3.80332284e-02,7.23319054e-02,2.98110046e-03,-6.22513592e-02,6.56076372e-02,2.09525451e-02,2.19415948e-02,-1.34995428e-03,-9.06020775e-02,2.12142467e-02,-1.63224037e-03,9.72291678e-02,-3.10167447e-02,-3.09801828e-02,-1.90541670e-02,-3.96139771e-02,2.09502075e-02,-9.64867976e-03,1.07652992e-02,2.92998832e-02,-6.14691451e-02,2.42477246e-02,-4.87532094e-02,2.83491425e-02,8.26478079e-02,4.19540368e-02,-3.42742465e-02,5.80971912e-02,-2.19746046e-02,1.28769483e-02,-1.10125542e-02,3.03942598e-02,2.98435502e-02,5.43411188e-02]
The Results
Our query that we just run returns the following results:
{
"responseHeader":{
.
.
.
},
"response":{
"numFound":6,
"start":0,
"maxScore":0.7036504,
"numFoundExact":true,
"docs":[{
"id":"0030eb55-95ee-4042-b574-1f1f4d2d0dd4",
"name":"Apple iPod",
"tags":["music","screen","iOS"],
"_version_":1815543355046625280,
"_root_":"0030eb55-95ee-4042-b574-1f1f4d2d0dd4",
"score":0.7036504
},{
"id":"f5122858-22bf-47fe-84b4-f0fa5e608d22",
"name":"Samsung Galaxy S24",
"tags":["phone","smartphone","screen","Android"],
"_version_":1815543355045576704,
"_root_":"f5122858-22bf-47fe-84b4-f0fa5e608d22",
"score":0.5939434
},{
"id":"9b6aa3b5-f02a-4969-ae21-daaafc58ece4",
"name":"Apple iPhone 13",
"tags":["phone","smartphone","screen","iOS"],
"_version_":1815543355005730816,
"_root_":"9b6aa3b5-f02a-4969-ae21-daaafc58ece4",
"score":0.5706577
},{
"id":"bbb744ae-ca90-4663-8a8e-fdbe8e00d935",
"name":"Apple iPhone 14",
"tags":["phone","smartphone","screen","iOS"],
"_version_":1815543355042430976,
"_root_":"bbb744ae-ca90-4663-8a8e-fdbe8e00d935",
"score":0.5706577
},{
"id":"f750c960-229b-4fd2-a9f7-c891afbb19d3",
"name":"Apple iPhone 15",
"tags":["phone","smartphone","screen","iOS"],
"_version_":1815543355044528128,
"_root_":"f750c960-229b-4fd2-a9f7-c891afbb19d3",
"score":0.5706577
},{
"id":"756e14fc-6320-4fee-9713-7d0c6123fa8a",
"name":"Samsung Microwave",
"tags":["kitchen","cooking","electric"],
"_version_":1815543355047673856,
"_root_":"756e14fc-6320-4fee-9713-7d0c6123fa8a",
"score":0.5591891
}]
}
}
As we can see, we retrieved all the documents, with our Apple iPod having the highest score, which is now calculated based not on document relevance to the query but on vector similarity. In our case, this is a very limited sample of documents and a tag field that is far from ideal, but we have a functioning, relatively simple example. The knn parser returned the top 10 documents, which, in our case, means all of them.
If we only want to retrieve documents with a score above a certain threshold, we can use the vectorSimilarity parser. For example, if we want to fetch documents with a similarity score equal to or greater than 0.7, our query would look like this:
http://localhost:8983/solr/test/query?q={!vectorSimilarity f=vector minReturn=0.7}[...]
And the results of the above query would look as follows:
{
"responseHeader":{
.
.
.
},
"response":{
"numFound":1,
"start":0,
"maxScore":0.7036504,
"numFoundExact":true,
"docs":[{
"id":"0030eb55-95ee-4042-b574-1f1f4d2d0dd4",
"name":"Apple iPod",
"tags":["music","screen","iOS"],
"_version_":1815543355046625280,
"_root_":"0030eb55-95ee-4042-b574-1f1f4d2d0dd4",
"score":0.7036504
}]
}
}
And now we only have a single document with the score higher than 0.7.
Filtering
The performance of the solution will depend on the number of candidates—that is, the documents that Solr will need to process. To reduce this number, we can use filtering with the preFilter parameter. For example, if we only want to retrieve phones (where the category field is equal to phone), our query would look like this:
http://localhost:8983/solr/test/query?q={!knn f=vector topK=10 preFilter=category:phone}[...]
This time the results would look as follows:
{
"responseHeader":{
.
.
.
},
"response":{
"numFound":4,
"start":0,
"maxScore":0.5939434,
"numFoundExact":true,
"docs":[{
"id":"a0f6fe9b-e82a-437e-b8ef-c12b6ee8d4a3",
"name":"Samsung Galaxy S24",
"tags":["phone","smartphone","screen","Android"],
"category":"phone",
"_version_":1815544743947403264,
"_root_":"a0f6fe9b-e82a-437e-b8ef-c12b6ee8d4a3",
"score":0.5939434
},{
"id":"ff2a471f-6d31-4dc1-9ec0-f5eacbfc1540",
"name":"Apple iPhone 13",
"tags":["phone","smartphone","screen","iOS"],
"category":"phone",
"_version_":1815544743943208960,
"_root_":"ff2a471f-6d31-4dc1-9ec0-f5eacbfc1540",
"score":0.5706577
},{
"id":"9d2955b2-ba6d-4f5d-a477-66909725b362",
"name":"Apple iPhone 14",
"tags":["phone","smartphone","screen","iOS"],
"category":"phone",
"_version_":1815544743945306112,
"_root_":"9d2955b2-ba6d-4f5d-a477-66909725b362",
"score":0.5706577
},{
"id":"93e5bd31-78b5-45b9-9c61-a0d8e8c0badb",
"name":"Apple iPhone 15",
"tags":["phone","smartphone","screen","iOS"],
"category":"phone",
"_version_":1815544743946354688,
"_root_":"93e5bd31-78b5-45b9-9c61-a0d8e8c0badb",
"score":0.5706577
}]
}
}
This time, we only received phones, meaning the documents were properly filtered.
The behavior of Solr is also defined by the query. For example, if the preFilter parameter is absent, Solr will use standard parameters known from full-text search, such as fq. We can construct our filtered query using the fq parameter, and it would look like this:
http://localhost:8983/solr/test/query?fq=category:phone&q={!knn f=vector topK=10}[...]
The results would be the same – four documents which we already saw.
Summary
We’ve seen some of the capabilities Solr offers when it comes to using vectors. We’ve explored what the two parsers allow and how to use them. Of course, this example is very simple, and consequently, if we want to start working on our own projects, first we need to invest time in finding or preparing a model, or alternatively, finding a model and fine-tuning it. The model used highly depends on the use case. There are many models available that can be used, such as the Snowflake/snowflake-arctic-embed-m-v1.5 model, or published models like Marqo/marqo-ecommerce-embeddings-B or Marqo/marqo-ecommerce-embeddings-L. It’s important to spend time selecting the right model, testing it, and evaluating its performance. In semantic search, model selection is crucial because it is responsible for creating the vectors that will be used for searching.