Introduction In this blog post, you will be guided with the process on how to effectively...
In this blog post, you will be guided with the process on how to effectively extract the keywords from a given context. Before we deep dive into the keyword extraction, let's try to understand what exactly the topics are?
Here's the brief content about the keyword extraction, generated with the help of ChatGPT.
Keyword extraction is a natural language processing (NLP) technique that involves identifying and extracting the most relevant words or phrases from a given text. The goal is to capture the essential topics or themes within the content, allowing for a concise representation of the document's key information. This process is valuable in various applications such as information retrieval, document summarization, and content categorization.
Keyword extraction plays a crucial role in improving search engine results, summarizing large documents, organizing content, and facilitating information retrieval in a more efficient and meaningful manner.
import sys
# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
# Authenticate user to Google Cloud
from google.colab import auth
auth.authenticate_user()
PROJECT_ID = "<>" # @param
LOCATION = "<>" # @param
if "google.colab" in sys.modules:
# Define project information
PROJECT_ID = PROJECT_ID
LOCATION = LOCATION
# Initialize Vertex AI
import vertexai
vertexai.init(project=PROJECT_ID, location=LOCATION)
The basic requirement for accomplishing the topic extraction is done via the careful consideration of the topic extraction prompt. Here's the code snippet for the same.
def get_keyword_extraction_prompt(content):
prompt = f"""Extract key keywords or phrases from the following text: """
prompt = prompt + """1. Identify and list the most important keywords or key phrases in the text. These keywords should capture the main topics, concepts, or subjects discussed in the text.
2. If there are subtopics or secondary themes mentioned in the text, list them as well. Ensure that the extracted keywords accurately represent the content's context.
3. Include the exact text span or sentence where each keyword or phrase is found in the original text.
4. If there are any ambiguous keywords or phrases, indicate the uncertainty and provide possible interpretations or context that might clarify the intended meaning.
5. Consider the context, relevance, and frequency of the keywords when determining their significance.
6. If the text suggests any actions, decisions, or recommendations related to the extracted keywords, provide a brief summary of these insights.
Ensure that your keyword extraction results are relevant, concise, and capture the essential topics within the text.
Here's the output schema:
```
{
"KeywordExtraction": [
{
"Keyword": "",
"Context": "",
"TextSpan": ""
}
]
}
```
Do not respond with your own suggestions or recommendations or feedback.
"""
return prompt
Now let's see a generic code for executing the above topic extraction prompt using the Google Gemini Pro model. Here's the code snippet for the same.
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
def execute_prompt(prompt, max_output_tokens=8192):
model = GenerativeModel("gemini-pro")
responses = model.generate_content(
prompt,
generation_config={
"max_output_tokens": max_output_tokens,
"temperature": 0,
"top_p": 1
},
stream=True,
)
final_response = []
for response in responses:
final_response.append(response.candidates[0].content.parts[0].text)
return ".".join(final_response)
Now is the time to perform the prompt execution and do some JSON transformation for the extraction of topics. Here's the code snippet for the same.
Code block for extracting the JSON from the LLM response. Please note, at this time, Google Gemini Pro being released to the public and has some known issues in building the formatted structured JSON response. Hence, the need to tweak a bit.
import re
import json
def extract_json(input_string):
# Extract JSON within ```
block
matches = re.findall(r'
```(.*?)```
', input_string, re.DOTALL)
if matches:
# Join the matches into a single string
json_content = ''.join(matches)
# Remove periods
json_content = re.sub(r'\.', '', json_content)
return json_content
else:
print("No
``` block found.")
return None
keywords = []
prompt = get_keyword_extraction_prompt(summary)
response = execute_prompt(prompt)
extracted_json = extract_json(response)
if extracted_json != None:
keywords.append(extracted_json)