Content Similarity Clustering

Content Similarity Clustering model groups together multiple pieces of content in query results that are similar to each other. This aids in readability and organization of query results.

Statistics

Type	Speed	Partner Type
Post Processing	Instant	Datastreamer Internal

Example Use Cases

In a query for news articles, for example, multiple articles may describe the same news event with the same names, places and details, but originating from different news sources. Some may be duplicates or near duplicates. By assigning cluster ids to individual documents, documents with similar content share the same cluster ids, making similar content associations clear.

In large scale analysis, multiple events may be within the same result. Applying the content similarity clustering will allow you to group similar events within the search result.

Compatible Data Sources

As a post-processing operation, it is run on demand for any source in English.

📘
Optimal Usage
Content Similarity Clustering was trained on long form data. Therefore this model performs optimally on news and similar content.

Example Query

{
     "query": {
          "data_sources": [
               "opoint_news"
          ],
          "query": "content.body: cats",
          "sort": [
            {
                "field": "content.published",
                "order": "desc" 
            }
          ]
     },
    "operations": [
        {
            "destination_path": "enrichment.content_similarity_clustering",
            "name": "content_similarity_clustering",
            "parameters": {
                "main": "content.body",
                "language": "enrichment.language"
            },
            "condition": {
                "operator": "and",
                "conditions": [
                    {
                        "path": "content.body",
                        "operator": "exists"
                    },
                    {
                        "path": "enrichment.language",
                        "value": "en",
                        "operator": "eq"
                    }
                ]
            }
        }
    ]
}'

Example Output

{
    "results": [
        {
            ...
            },
           "enrichment": {
			  "content_similarity_clustering": {
                    "cluster_id": 5,
                    }
}