Question 1

What is vocabulary extraction and who is it for?

Accepted Answer

Vocabulary extraction is the process of taking a Japanese sentence or paragraph and pulling out every unique word — together with its reading, dictionary form (lemma), and part of speech — so you can study the words instead of re-reading the same text. It is the core technique behind sentence mining, the study method made popular by polyglots and JLPT learners who build their vocabulary list from native material rather than from a generic textbook word list. Paste a sentence from a manga panel, a news headline, a YouTube subtitle, or a textbook reading, and the extractor returns a clean study-ready table.

Question 2

How does the tokenizer work — is it the same engine as IMEs?

Accepted Answer

Yes. The extractor uses kuromoji, the same Japanese morphological analyser that powers most Japanese-language IMEs, dictionaries, and learning apps. Kuromoji segments the input into tokens, attaches a reading in katakana (which the tool converts to hiragana for readability), a base form / lemma, and a part-of-speech tag. The engine handles inflected verbs (食べました → 食べる), conjugated adjectives (寒かった → 寒い), and compound forms accurately for everyday modern Japanese. Rare, archaic, or brand-new slang may still produce a less precise lemma — the table marks those entries so you can verify them in the dictionary before adding them to your study list.

Question 3

What is sentence mining and how do I use this tool for it?

Accepted Answer

Sentence mining is a study technique where you collect i+1 sentences — sentences where you understand everything except one word — and turn each unknown word into a flashcard. The classic workflow is: read native content, find a sentence with one new word, extract that word into Anki with its reading and meaning, then drill it. This extractor accelerates step two: paste the sentence, scan the table, identify the word you do not know, and export the row to your study list. Doing this consistently with 5–10 sentences a day is the fastest realistic path from N3 to N1, because every word you study is one you have already encountered in context.

Question 4

Can I export the vocabulary list to Anki or a spreadsheet?

Accepted Answer

Yes. Click the "Copy CSV" button and the entire table is copied to your clipboard in the format word,reading,base,pos. Paste it into a spreadsheet (Excel, Google Sheets, Numbers) or directly into Anki via File → Import → CSV. From there you can add your own English meanings, example sentences, or audio. The CSV format also pastes cleanly into Notion, Obsidian, and most flashcard apps that accept tabular imports. Add-on tip: combine this with the Japanese Dictionary tool to look up English meanings for any row that needs more context.

Question 5

Does the extractor show JLPT level for each word?

Accepted Answer

Not directly in the extractor table — JLPT level data is shown in the Japanese Dictionary and Vocabulary Explorer tools (both linked below) where each word has a dedicated entry with its JLPT band, example sentences, and related vocabulary. The extractor focuses on speed: it identifies the word, gives you the reading and lemma so you can search the dictionary, and lets you triage 20–50 words from a passage in seconds. To check JLPT level for a specific word, click into the Japanese Dictionary tool and search by the base form returned here.

Question 6

Why are particles and punctuation skipped from the output?

Accepted Answer

Particles like は, を, が, に, で are grammar function words rather than vocabulary you "learn" the way you learn nouns or verbs. Including them in a sentence-mining export would flood your study list with the same 10 particles in every sentence and dilute the signal. The extractor filters them automatically, along with punctuation and symbols, so the table contains only content words — nouns, verbs, adjectives, adverbs, and proper nouns. If you want to study particle usage specifically, use the Particle Quiz tool linked below instead.

Question 7

How long can the input text be?

Accepted Answer

The tool accepts up to about 2,000 characters per run — roughly a long paragraph or a short news article. Inside that limit the tokenizer runs in well under a second on a typical broadband connection. For longer documents (chapters, full articles), split the text into 1–2 paragraph chunks and run the extractor several times — this also produces more manageable study batches of 20–40 unique words rather than 200+ at once. The character counter under the textarea shows you exactly how much of the budget you have used.

Vocabulary Extractor

How the Vocabulary Extractor Works

Morphological Tokenization

Smart Deduplication

Particle Filtering

Frequently Asked Questions

Build your N3 vocabulary the smart way

Vocabulary Extractor

How the Vocabulary Extractor Works

Morphological Tokenization

Smart Deduplication

Particle Filtering

Frequently Asked Questions

Related Tools

Japanese Sentence Analyzer

Furigana Generator

Japanese Dictionary

Vocabulary Explorer

Studying in Japan? See top universities

Build your N3 vocabulary the smart way