Blacklist Filtering
Remove unwanted web content through blacklist filtering.
Helpful Resource
This resource is a supporting guide for use cases in or surrounding the Platform. They may not be indicative of platform features and are meant to serve as a guide.
What is blacklist filtering?
A common strategy for removing low-value content from data pipelines is the application of a keyword-based blacklist filter early in your Dynamic Pipeline. Many organizations wish to remove the content early to prevent additional costs in storage, processing, enrichment, or other activities later on.
A blacklist is generally a collection of terms.
How to apply a blacklist filter?
The most common application in a dynamic pipeline of a blacklist filter is with the JSON Routing component. It can be configured to use Lucene query logic to search content being ingested, and route content that contains the blacklist terms.
Some use cases route the matching content into the "Trash Can" component, others choose to apply a tag and route the content to merge back in at the end of the pipeline.
Apply blacklist filtering after schema application
If you apply blacklist filtering before all the schemas for the data sources have been unified, you may not have a common field to use as a basis of routing. Applying Unify or other schema transformation is suggested, even if you are using a single known source of data.
How can I create a blacklist?
Creating a blacklist can be done by adding to a starting list, or by analysis of the content flowing through the pipeline. While there is no perfect blacklist, using the JSON Router and keywords can allow you to easily adapt it and add to it in realtime.
To start, try using one of the starting lists in the next section.
Starting Exclusion Lists
"Low-Quality" Content Keyword List
This list was generated from a detailed university study at the Nanyang Technological University in Singapore. The analysis found that the following terms were "hot words" in low-quality content.
weather OR updates OR theweatherchannel OR channel OR transponder OR snail OR although OR automatically OR follow OR followed OR libra OR unfollowed OR practical OR tug OR aries OR profess OR capricorn OR conflicting OR checked OR virgo OR embol OR maintaining OR pragma OR prowess OR readily OR gemini OR scorpio OR sides OR ap- parent OR capable OR strategic OR foresee OR imagination OR unfolding OR approach OR measurable OR taurus OR comprehend OR stellar OR aquarius OR enables OR highly OR pisces OR det OR sagittarius OR leo OR emotions OR financial OR somethin OR fully OR follows OR understanding OR calm OR closest OR planning OR witness OR clearly OR convince OR found OR begin OR creative OR matters OR followers OR huaraches OR presentation OR attitude OR earning OR seem OR gucci OR cancer OR gain OR giants OR benefits OR checkout OR giveaway OR challenge OR encounters OR custom OR monsters OR tos OR coins OR pips OR wild OR collected OR current OR mgwv OR harvested OR candid OR changes OR enter OR retweet OR straw OR null OR smart OR stats OR unfollowers OR bayonet OR followtrick OR mbf OR camel OR limitless OR hats OR click OR followback OR teamfollowback OR unf OR anotherfollowtrain OR positively OR eurusd OR flashiest OR adurabyhenshawblaze OR csgorumble OR fade OR mhmm OR safaree OR supporters OR allegedly OR newborn OR samuels OR thumb OR alubarna OR healthier OR reflective OR useless OR wers OR badboy OR baths OR decay OR loaner OR sail
Financial & Political Content Keyword List
This list was generated from keyword analysis at Datastreamer. The analysis found that the following terms terms were found to be most present in content discussing financial or political content.
Government OR Policy OR Election OR Vote OR Democracy OR President OR Congress OR Senate OR Protest OR Diplomacy OR Sanctions OR Immigration OR Security OR Trade OR Rights OR Military OR Defense OR Law OR Crisis OR Scandal OR NATO OR EU OR China OR Russia OR Taiwan OR Korea OR Economy OR Inflation OR Stocks OR Bonds OR Trading OR Shares OR Dow OR Nasdaq OR Investment OR Investor OR Returns OR Commodities OR Gold OR Oil OR Interest OR Central OR Federal OR Recession OR Growth OR Jobs OR Tax OR Capital OR Estate OR Hedge OR IPO OR Crypto OR Forex OR Debt OR Treasury OR Budget OR Deficit OR Tariff OR Regulation OR Pension OR Stimulus OR Welfare OR Healthcare OR Education
Sexual Terms Blacklist
We have a starting template for a blacklist of sexual terms. If you require that blacklist please reach out.
By Request Only
Hate Speech Exclusion List
This blacklist contains terms that people would use in hateful or violent content. This list was generated from keyword analysis at Datastreamer.
hate OR kill OR useless OR vermin OR scum OR invade OR infestation OR worthless OR plague OR trash OR inferior OR enemy OR disgusting OR degenerate OR threat OR parasite OR filthy OR mongrel OR savage OR barbaric OR unworthy OR extermination OR cancer OR infestation OR cockroach OR filth OR subhuman OR scum OR disease OR traitor OR impure OR rat OR filth OR annihilate OR unwanted OR reject OR backstabber OR slave OR impostor OR reject OR toxic OR invade OR pollution OR garbage OR snake OR swamp OR bastard
Product Promotion Exclusion List
This blacklist contains terms that are often using in the selling or promotion of products on web or social platforms.
promo OR promotion OR checkout OR shipping OR order OR product OR store OR purchase OR item OR delivery OR exclusive OR bargain OR inventory OR cart OR wholesale OR clearance OR stock OR brand OR collection OR bundle OR flashsale OR giveaway OR newarrival OR handmade OR custom OR preorder OR gift OR guarantee OR refund OR quality OR fashion OR trend OR luxury OR ecommerce OR dropship OR reseller OR marketplace OR savings OR affordable OR trustpilot OR g2 OR reviews OR oem
Updated about 1 month ago