linguist.link calculates insights about the content of a web page using various natural language processing techniques.
Using linguist.link, you can find the following insights about a web page:
- "Surprising" words
- The most common words
- Common bigrams, trigrams, and quadgrams
- Named entities (i.e. people, organizations)
linguist.link only supports English articles at this time.
Surprisal is a metric that encompasses how unique a word is in a corpus of text relative to a baseline corpus of text.
linguist.link calculates surprisal using a corpus of New York Times articles. This corpus was chosen since the text in news articles is grammatically varied. The less common a word is in that corpus, the higher the surprisal.
The full corpus of word surprisals is available as JSON data.
Add a Language
linguist.link references an English language corpus of New York Times articles to calculate surprisal. Furthermore, the Flesch-Kincaid metric for evaluating reading time and reading level is only applicable to English.
We would love to have support for more languages on linguist.link!
To add a language, you can submit a PR to the linguist.link repository. You will need to:
- Add a corpus of text in the language you want to add to the repository. Then, add logic to the
readability.pyfile to load the corpus.
- Add logic to the
readability.pyfile to calculate surprisal for the language you want to add using the corpus.
- Add a subpath (i.e. /uk/, /fr/) that will serve the language you want to add. You can copy the
indexsubpath as a template.
- Add any readability metrics that apply to the language you want to add (optional).