System Architecture
The models are based on two types of content: internet content (containing mostly high-volume slang) and internet trolls. For this, fast APIs can be deployed in production to moderate content. The second type of content is toxic users & imposters, which requires more machine learning and knowledge to filter properly. While this content is lower volume, it is much more potentially damaging for the app and users. Low volume content is treated with deeper algorithms in asynchronous time, whereas high volume needs fast responding APIs.
Once the content type is identified, it enters a data flow consisting of different models. Each language, category, and feature have different constraints. There are about 120 models in production. These are based mainly on (CBOW+fasttext and NBSVM). For example, a chat-like message will filter through at least three models. A typical pipeline involves:
1. Stemming, lemmatization, deobfuscation
∂σ уαℓℓ ωαηηα вє ƒяιєη∂ѕ -> (do, you, want, be, friend)
2. Language detection
3. Personal information detection
4. Profanity filtering