Content Similarity Clustering
Cluster (group) similar content together from a query.
Content Similarity Clustering model groups together multiple pieces of content in query results that are similar to each other. This aids in readability and organization of query results.
Statistics
Type | Speed | Partner Type |
---|---|---|
Post Processing | Instant | Datastreamer Internal |
Example Use Cases
In a query for news articles, for example, multiple articles may describe the same news event with the same names, places and details, but originating from different news sources. Some may be duplicates or near duplicates. By assigning cluster ids to individual documents, documents with similar content share the same cluster ids, making similar content associations clear.
- In large scale analysis, multiple events may be within the same result. Applying the content similarity clustering will allow you to group similar events within the search result.
Compatible Data Sources
As a post-processing operation, it is run on demand for any source in English.
Optimal Usage
Content Similarity Clustering was trained on long form data. Therefore this model performs optimally on news and similar content.
Example Query
{
"query": {
"data_sources": [
"opoint_news"
],
"query": "content.body: cats",
"sort": [
{
"field": "content.published",
"order": "desc"
}
]
},
"operations": [
{
"destination_path": "enrichment.content_similarity_clustering",
"name": "content_similarity_clustering",
"parameters": {
"main": "content.body",
"language": "enrichment.language"
},
"condition": {
"operator": "and",
"conditions": [
{
"path": "content.body",
"operator": "exists"
},
{
"path": "enrichment.language",
"value": "en",
"operator": "eq"
}
]
}
}
]
}'
Example Output
{
"results": [
{
...
},
"enrichment": {
"content_similarity_clustering": {
"cluster_id": 5,
}
}
Updated about 1 year ago