NP Annotator Documentation
A structured annotation tool for noun phrases across multiple languages — designed for linguists, corpus researchers, and language data curators.
What is NP Annotator?
Noun Phrase Annotator is a browser-based tool for annotating noun phrases in linguistic datasets. You upload a file containing source sentences and their extracted noun phrases, then work through each row assigning grammatical tags, glosses, and translations — all stored locally in your browser with no server required.
The tool supports multiple languages (including UTF-8 scripts such as Turkish, Chinese, Russian and Greek), a three-level tag hierarchy (Category → Subcategory → Type), per-token grammatical glossing with preset chips, and a persistent lexicon that carries gloss suggestions across phrases.
localStorage. Nothing is sent to any server.
Export your data manually when you are ready.
Workflow overview
The annotation process follows a linear five-step flow:
localStorage after every phrase. Close the tab, reopen the file,
and click Resume session to continue exactly where you left off.
Upload a file
NP Annotator accepts CSV (.csv), Excel (.xlsx), and legacy Excel (.xls) files. CSV files are read with explicit UTF-8 encoding, so characters such as ş, ğ, ı, ü, ö, ç (Turkish), Arabic, Persian, and other scripts are handled correctly.
Header row detection
The tool auto-detects whether the first row is a header or data by checking whether all values in row 1
are non-numeric. If the first row looks like data, column names are generated automatically
(Column 1, Column 2, …). You can override this with the
"First row treated as header" checkbox that appears after upload.
File structure requirements
Your file must contain at least a Context column (required) and ideally a Data column and a Data Code column:
Map columns
After uploading, you assign a role to each column in your file. All fields in Step 2 are required. The preview table colour-codes your selections so you can visually confirm the mapping before proceeding.
| Field | Status | Description |
|---|---|---|
| context | Required | The full source sentence the NP was extracted from. Shown above the phrase in the annotate workspace. The NP is highlighted within it if a match is found. |
| data | Required | The noun phrase to annotate. Pre-fills the editable NP field. Can be corrected inline if the source data contains errors. |
| data code | Required | A per-row corpus or data identifier read from the file (e.g. EN-01, TR-BNC-003). Stored alongside every saved annotation. |
| Language | Required | Select the language of the entire dataset from the dropdown. This is a session-level constant — it applies to every row in the file. Used as the key in the lexicon. |
| Source name | Required | A free-text label for the source corpus or document collection (e.g. BNC Corpus, OPUS-TR). Not read from a column — typed once for the whole session. |
| Dataset code | Required | A short session-level identifier (e.g. TR-01). Appears in all exports and saved phrase records. |
context, sentence, np, phrase, code, etc.).
Check the preview table to confirm the auto-detection is correct before clicking Start.
Session settings
Once all column fields are filled and Start annotating is clicked, your session becomes active. The header shows a strip of session chips displaying the current language, dataset code, and context column.
Clicking ↺ New session reloads the page. Your auto-saved data remains in
localStorage and can be resumed at any time — a Resume session banner
appears automatically on the next page load.
Clicking the NP Annotator logo returns you to the setup screen without discarding data. If you have unsaved annotation progress on the current phrase it will be lost, but all previously saved phrases remain intact.
Annotate page
The annotate workspace is divided into four columns:
-
1
Rows — a scrollable list of every row in your file. Each row shows the context sentence. Completed rows are marked with a green done badge. A progress bar at the bottom shows overall completion.
-
2
Noun phrase — shows the source sentence with the NP highlighted in purple, and an editable text field below it. Edit the field and click Reload ↺ to re-tokenise if the NP in your data contains an error. The context sentence is never affected by edits.
-
3
Tag selection — click tokens in the phrase stage to select them (one or more), then navigate the three-step tag picker: Category → Subcategory → Type. Previously tagged token sequences show an amber suggestion banner.
-
4
Gloss — one input per token for word-by-word glossing, plus grammatical chips (PL, SG, GEN, POSS, LOC, REL…) that append to whichever field is focused. A phrase translation field at the bottom captures the full English rendering.
Tag selection
Tags are organised into a three-level hierarchy. Clicking a token (or multiple tokens) opens Step 1 of the picker. Each step narrows the choice:
- 1Category — the broad grammatical class (Noun, Adjective, Article, Possessive…)
- 2Subcategory — the subdivision within the category (e.g. Animate / Inanimate for Noun; Intersective / Non-intersective for Adjective)
- 3Type — the specific type within the subcategory (e.g. Shape, Color, Material for Intersective; Object, Event for Inanimate). Steps without types apply immediately at Step 2.
Default tag hierarchy
Tag suggestions
When you select a sequence of tokens that has been tagged before in the current session, an amber banner appears with the previous tag(s). For a single suggestion, one Confirm button applies it instantly. For multiple suggestions, each is shown on its own row with an individual ✓ Confirm button — pick the right one or dismiss the banner and tag manually.
Suggestions are based on exact phrase match from previously saved annotations. Selecting a token like very alone will not inherit a suggestion from a phrase like very big — both tokens must be selected together.
Custom categories
The Categories tab lets you add, rename, and delete categories, subcategories, and types. Changes take effect immediately in the tag picker. The ✎ Edit categories link at the top of the Tag selection card is a shortcut to that tab.
Gloss panel
The Gloss card sits to the right of the Tag selection card. It contains one input field per token for word-by-word glossing, a palette of preset grammatical chips, and a phrase translation field.
Grammatical gloss chips
Chips append a grammatical abbreviation to whichever gloss input is currently focused —
click a field first, then click a chip. The result is dot-notation: cat.PL.POSS.
You can also type freely into any gloss field — the chips are shortcuts, not restrictions. A Turkish example:
Gloss suggestions
If a token has been glossed before in the current session, its gloss is pre-filled from the Lexicon. When a word has multiple senses (different glosses in different contexts), the field shows the first recorded sense — edit it freely. The Lexicon tab shows all senses for every word and lets you edit them inline.
Phrase translation
The Translation field at the bottom of the Gloss card captures the full English rendering of the entire noun phrase (e.g. red cat for the noun phrase kırmızı kedi in Turkish). The translation of noun phrase is required to be completed before saving.
Saving a phrase
Click Save phrase → (the primary button below the Tag and Gloss cards) to commit a phrase to the session. The tool validates all three requirements before saving:
- ✓All tokens tagged — every token chip must show a tag label and a green border. Untagged tokens pulse red.
- ✓All tokens glossed — every gloss input must be non-empty. Empty fields are highlighted in red.
- ✓Phrase translation provided — the translation field at the bottom of the Gloss card must be filled.
After saving, the row in the Rows list is marked done, the phrase workspace resets, and the progress bar advances. The saved phrase immediately appears in the Data tab.
Data tab
The Data tab shows all saved phrases in the current session. Each entry displays the phrase ID, the phrase text, language and code chips, the context sentence, phrase translation, and the full ordered tag sequence.
Individual entries can be removed with the Remove button. Clear all wipes the entire session's annotations (with confirmation). Exports are available at the top of the tab.
Lexicon
The Lexicon tab maintains a persistent vocabulary of every word form encountered during annotation.
Each entry has a unique code (LEX-0001, LEX-0002…) keyed by
word form and language. Words can have multiple senses (e.g. akıllı
→ "smart" in one context, "clever" in another), each with its own gloss and list of phrase IDs
where that sense was used.
The Lexicon can be exported separately as JSON, CSV, or
XLSX from the Lexicon tab. Exporting the lexicon JSON and re-importing it
in a future session carries over all codes and glosses, so the same word always receives
the same LEX-XXXX code across datasets.
Categories
The Categories tab lets you customise the full three-level tag hierarchy. All changes apply immediately to the tag picker in the Annotate tab.
- +Add main category — type a name in the input at the bottom and click Add. An ID is generated automatically from the name.
- +Add subcategory — click + Subcategory next to any category and enter a name. The ID is prefixed with the parent's ID.
- +Add type — click + Type next to any subcategory. Types are the third and final level.
- ×Delete — any level can be deleted with the × button. Existing annotations that used a deleted tag are not affected.
Auto-save
NP Annotator automatically saves the full session state to localStorage after every
state-changing action: saving a phrase, deleting an annotation, editing the lexicon, or modifying
categories. No manual save step is required.
The auto-save bar below the header shows a green status dot, the timestamp of the last save, and counts of saved phrases and lexicon entries. The dot briefly flashes amber when a save is in progress.
Resuming a session
On the next page load, if saved data is found, a blue banner appears at the top of the setup screen: "Saved session found — N phrases annotated." Click Resume session to restore everything instantly, or Start fresh to begin a new session (the old data remains in storage until cleared).
Clearing saved data
localStorage. This cannot be undone. Always export your annotations first.
Storage limits vary by browser but are typically 5–10 MB per origin. Sessions with thousands of rows and hundreds of annotations stay well within this limit.
Export formats
Exports are available from the Data tab (annotations) and the Lexicon tab (vocabulary). All exports use UTF-8 with a BOM so files open correctly in Excel with non-Latin scripts.
type column. Column widths auto-sized.np_phrases_*.csv, np_tokens_*.csv, np_annotations_*.csv. Primary/foreign key relationships for direct DB import.session, phrases, tokens, and annotations arrays.LEX-XXXX codes and glosses.Database normalisation (CSV)
The three CSV files form a relational schema:
Tips & shortcuts
LEX-XXXX codes consistent.
Interaction reference
| Action | How |
|---|---|
| Select a token | Click the token chip in the phrase stage |
| Select multiple tokens | Click each token individually (no modifier key needed) |
| Deselect a token | Click the selected token again |
| Clear all selections | Clear selection button in the Noun Phrase card header |
| Reset entire phrase | Reset button — clears tokens and all annotations |
| Apply tag (no types) | Category → Subcategory — applies immediately |
| Apply tag (with types) | Category → Subcategory → Type |
| Navigate back in picker | ← Back link at the top of Step 2 or Step 3 |
| Confirm suggestion | Confirm button in the amber banner |
| Dismiss suggestion | Change or Dismiss button in the banner |
| Append grammatical gloss | Focus a gloss field, then click a teal chip |
| Save phrase | Save phrase → button (all validation must pass) |
| Edit categories | ✎ Edit categories link in Tag selection card header |