Language Detection (Datastreamer)

Detecting language from any field of a given input

About

The language detection component is designed to detect the language used in any field from the given input. The detected language will be presented in the form of two-letter short form (ISO 3166-1 alpha-2 code). It should be noted that we can use this component in both real-time and batch-processing manner.

Adding to your Dynamic Pipeline

This Component can be added to your Dynamic pipelines through the "Language Detect" component. It requires the following fields for configuration:

  • Destination Path (Required): the metadata field in which the ISO-3166-1 alpha-2 code will be outputted to. This will be the detected language. The field can be an existing field, or the component can create a new field for this functionality.
  • Main (Required): This is the input field that the Language Detection will use as a source. By default, the content.body field will be used, however any field can be chosen as an output.


Compatible Languages

The coverage is as follows which includes 144 different languages.

LanguageLanguage ID (ISO 3166-1 alpha-2 code)
AfarAA
AfrikaansAF
AkanAK
AlbanianSQ
AmharicAM
ArabicAR
ArmenianHY
AssameseAS
AymaraAY
AzerbaijaniAZ
BashkirBA
BasqueEU
BelarusianBE
BengaliBN
BihariBH
BokmalNB
BosnianBS
BulgarianBG
BurmeseMY
CatalanCA
ChineseZH
ChineseZH-HANT
CorsicanCO
CroatianHR
CzechCS
DanishDA
DhivehiDV
DutchNL
EnglishEN
EsperantoEO
EstonianET
EweEE
FijianFJ
FinnishFI
FrenchFR
FrisianFY
GalicianGL
GandaLG
GeorgianKA
GermanDE
GreekEL
GuaraniGN
GujaratiGU
HausaHA
HebrewHE
HindiHI
HungarianHU
IcelandicIS
IgboIG
IndonesianID
InterlinguaIA
InterlingueIE
InupiakIK
IrishGA
ItalianIT
JapaneseJA
JavaneseJV
KannadaKN
KashmiriKS
KazakhKK
KhmerKM
KinyarwandaRW
KoreanKO
KurdishKU
KyrgyzKY
LaothianLO
LatinLA
LatvianLV
LingalaLN
LithuanianLT
LuxembourgishLB
MacedonianMK
MalagasyMG
MalayMS
MalayalamML
MalteseMT
MaoriMI
MarathiMR
MongolianMN
MontenegrinCNR
NdebeleNR
NepaliNE
NorwegianNO
Norwegian_nNN
NyanjaNY
OccitanOC
OriyaOR
OromoOM
OssetianOS
PashtoPS
PersianFA
PolishPL
PortuguesePT
PunjabiPA
QuechuaQU
Rhaeto_romanceRM
RomanianRO
RundiRN
RussianRU
SamoanSM
SangoSG
SanskritSA
Scots_gaelicGD
SerbianSR
SesothoST
ShonaSN
SindhiSD
SinhaleseSI
SiswantSS
SlovakSK
SlovenianSL
SomaliSO
SpanishES
SundaneseSU
SwahiliSW
SwedishSV
TagalogTL
TajikTG
TamilTA
TatarTT
TeluguTE
ThaiTH
TibetanBO
TigrinyaTI
TongaTO
TsongaTS
TswanaTN
TurkishTR
TurkmenTK
TwiTW
UighurUG
UkrainianUK
UrduUR
UzbekUZ
VendaVE
VietnameseVI
VolapukVO
WelshCY
WolofWO
XhosaXH
YiddishYI
YorubaYO
ZhuangZA
ZuluZU

Usage in Search API

This Operation allows a user to specify the destination field, source fields, and separator.

{
    "query": {
		...
},
    "operations": [
        {
            "name": "langdetect",
            "destination_path": "operations.language",
            "parameters": {
                "main": "content.body"
            }
        }
    ]
}