Getting Started with Embedding using Google Gemini
Embeddings are numerical representations of objects. In natural language processing (NLP), text embeddings can capture semantic meanings and relationships between words or phrases, which enables machines to understand and process human language effectively.
There are many providers, including Google Gemini, OpenAI, Azure AI, that can generate embedding from text document. OpenAI no longers provide free tier credit, so in this post, we will guide you on how to generate text embeddings using Google Gemini API.
Setting Up Google Gemini API Key
Signup/Signin your Google Gemini account. Then, navigate to the API key page and click “Create API Key”. Once you API key is generated, please save this somewhere safe and do not share it with anyone.
Install Google Gemini python package
!pip install -q -U google-generativeai
Initiate Gemini API client
import google.generativeai as genai
GOOGLE_API_KEY="<YOUR_GEMINI_API_KEY>"
genai.configure(api_key=GOOGLE_API_KEY)
Generate Text Embeddings
We create a function to use Gemini API to generate embedding from text.
def get_embedding(text, model="models/text-embedding-004"):
# Task types: https://ai.google.dev/gemini-api/docs/get-started/tutorial?lang=python#use_embeddings
# Model list: https://ai.google.dev/gemini-api/docs/embeddings#generate-embeddings
text = text.replace("\n", " ")
result = genai.embed_content(
model="models/text-embedding-004",
content=text,
task_type="semantic_similarity")
if result is not None:
return result['embedding']
else:
print("Error generating text embedding")
return None
Using this function, we can generate embeddings for any text (subject to maximum model token). The embeddings can be used for different tasks:
- Search and Retrieval
- Similarity Measure
- Downstream prediction
Google Gemini provides 2 embedding models and 4 task types. Specifying different task_type(s) will result in different embeddings for the same text.
Example with Text Embedding similarity measure
Function to calculate embedding cosine similarity:
import numpy as np
from numpy.linalg import norm
def calculate_similarity(embedding_1, embedding_2):
norm_1 = norm(embedding_1)
norm_2 = norm(embedding_2)
if norm_1 == 0 or norm_2 == 0:
return 0.0
else:
return np.dot(embedding_1, embedding_2)/(norm_1*norm_2)
Let’s say we want to create a job recommender system. We can use the above function to generate embeddings for each job posting and for a user resume. Then, we calculate similarity between job postings’ embeddings and user resume embedding. The job that is most suitable to the user should have the highest similarity score.
“semantic_similarity” task_type and cosine similarity measure would be suitable for this use case. You can see a complete example in this Google Colab Notebook.
What’s next
While embedding is powerful, it might not be enough for certain use cases that require keyword matching. In the next post, we will talk about how to use Gemini AI to extract relevant keywords from text.