← Back to app

NP Annotator Documentation

A structured annotation tool for noun phrases across multiple languages — designed for linguists, corpus researchers, and language data curators.

What is NP Annotator?

Noun Phrase Annotator is a browser-based tool for annotating noun phrases in linguistic datasets. You upload a file containing source sentences and their extracted noun phrases, then work through each row assigning grammatical tags, glosses, and translations — all stored locally in your browser with no server required.

The tool supports multiple languages (including UTF-8 scripts such as Turkish, Chinese, Russian and Greek), a three-level tag hierarchy (Category → Subcategory → Type), per-token grammatical glossing with preset chips, and a persistent lexicon that carries gloss suggestions across phrases.

No data leaves your browser. All annotation work is stored in localStorage. Nothing is sent to any server. Export your data manually when you are ready.

Workflow overview

The annotation process follows a linear five-step flow:

Step 1 Upload CSV or Excel
Step 2 Map columns Context, Data, Code
Step 3 Configure Language & codes
Step 4 Annotate Tag · Gloss · Save
Step 5 Export XLSX · CSV · JSON
Resume at any time. Your progress is auto-saved to localStorage after every phrase. Close the tab, reopen the file, and click Resume session to continue exactly where you left off.

Upload a file

NP Annotator accepts CSV (.csv), Excel (.xlsx), and legacy Excel (.xls) files. CSV files are read with explicit UTF-8 encoding, so characters such as ş, ğ, ı, ü, ö, ç (Turkish), Arabic, Persian, and other scripts are handled correctly.

Header row detection

The tool auto-detects whether the first row is a header or data by checking whether all values in row 1 are non-numeric. If the first row looks like data, column names are generated automatically (Column 1, Column 2, …). You can override this with the "First row treated as header" checkbox that appears after upload.

Column 0 issue. If your data column is the very first column in the file, make sure to select it explicitly in the column mapping step. Column index 0 is valid and fully supported.

File structure requirements

Your file must contain at least a Context column (required) and ideally a Data column and a Data Code column:

context,np,code "The old man walked past the red building.","the red building","EN-01" "Benim kırmızı kedilerim çok aktif.","kırmızı kedilerim","TR-01" "Der kleine schwarze Hund bellt laut.","Der kleine schwarze Hund","DE-01"
📁
Sample files available. If you don't have your own data yet, download one of the sample .csv files shown on the upload screen. Each sample is pre-structured with the three required columns.

Map columns

After uploading, you assign a role to each column in your file. All fields in Step 2 are required. The preview table colour-codes your selections so you can visually confirm the mapping before proceeding.

Field Status Description
context Required The full source sentence the NP was extracted from. Shown above the phrase in the annotate workspace. The NP is highlighted within it if a match is found.
data Required The noun phrase to annotate. Pre-fills the editable NP field. Can be corrected inline if the source data contains errors.
data code Required A per-row corpus or data identifier read from the file (e.g. EN-01, TR-BNC-003). Stored alongside every saved annotation.
Language Required Select the language of the entire dataset from the dropdown. This is a session-level constant — it applies to every row in the file. Used as the key in the lexicon.
Source name Required A free-text label for the source corpus or document collection (e.g. BNC Corpus, OPUS-TR). Not read from a column — typed once for the whole session.
Dataset code Required A short session-level identifier (e.g. TR-01). Appears in all exports and saved phrase records.
Auto-detection. Column selects are pre-populated by scanning header names for common patterns (context, sentence, np, phrase, code, etc.). Check the preview table to confirm the auto-detection is correct before clicking Start.

Session settings

Once all column fields are filled and Start annotating is clicked, your session becomes active. The header shows a strip of session chips displaying the current language, dataset code, and context column.

Clicking ↺ New session reloads the page. Your auto-saved data remains in localStorage and can be resumed at any time — a Resume session banner appears automatically on the next page load.

Clicking the NP Annotator logo returns you to the setup screen without discarding data. If you have unsaved annotation progress on the current phrase it will be lost, but all previously saved phrases remain intact.


Annotate page

The annotate workspace is divided into four columns:

  1. 1
    Rows — a scrollable list of every row in your file. Each row shows the context sentence. Completed rows are marked with a green done badge. A progress bar at the bottom shows overall completion.
  2. 2
    Noun phrase — shows the source sentence with the NP highlighted in purple, and an editable text field below it. Edit the field and click Reload ↺ to re-tokenise if the NP in your data contains an error. The context sentence is never affected by edits.
  3. 3
    Tag selection — click tokens in the phrase stage to select them (one or more), then navigate the three-step tag picker: Category → Subcategory → Type. Previously tagged token sequences show an amber suggestion banner.
  4. 4
    Gloss — one input per token for word-by-word glossing, plus grammatical chips (PL, SG, GEN, POSS, LOC, REL…) that append to whichever field is focused. A phrase translation field at the bottom captures the full English rendering.
All fields must be complete before saving. Every token must be tagged, every token must have a gloss, and a phrase translation is required. Incomplete tokens are highlighted in red when you try to save.

Tag selection

Tags are organised into a three-level hierarchy. Clicking a token (or multiple tokens) opens Step 1 of the picker. Each step narrows the choice:

  1. 1
    Category — the broad grammatical class (Noun, Adjective, Article, Possessive…)
  2. 2
    Subcategory — the subdivision within the category (e.g. Animate / Inanimate for Noun; Intersective / Non-intersective for Adjective)
  3. 3
    Type — the specific type within the subcategory (e.g. Shape, Color, Material for Intersective; Object, Event for Inanimate). Steps without types apply immediately at Step 2.

Default tag hierarchy

NOUN Noun
Animate NOUN-ANIM
Inanimate NOUN-INANIM
Object NOUN-INANIM-OBJ
Event NOUN-INANIM-EVT
ADJ Adjective
Intersective ADJ-INT
Shape · Color · Material
Non-intersective ADJ-NINT
Size · Age · Qualifier
ART Article — Definite · Indefinite
POSS Possessive — Genitive · PP-Genitive
NUM Numeral — Ordinal · Cardinal
DEM Demonstrative — Proximal · Distal
QUANT Quantifier — Existential · Universal
RC Relative Clause — Restrictive · Non-restrictive
PP Prepositional Phrase

Tag suggestions

When you select a sequence of tokens that has been tagged before in the current session, an amber banner appears with the previous tag(s). For a single suggestion, one Confirm button applies it instantly. For multiple suggestions, each is shown on its own row with an individual ✓ Confirm button — pick the right one or dismiss the banner and tag manually.

Suggestions are based on exact phrase match from previously saved annotations. Selecting a token like very alone will not inherit a suggestion from a phrase like very big — both tokens must be selected together.

Custom categories

The Categories tab lets you add, rename, and delete categories, subcategories, and types. Changes take effect immediately in the tag picker. The ✎ Edit categories link at the top of the Tag selection card is a shortcut to that tab.


Gloss panel

The Gloss card sits to the right of the Tag selection card. It contains one input field per token for word-by-word glossing, a palette of preset grammatical chips, and a phrase translation field.

Grammatical gloss chips

Chips append a grammatical abbreviation to whichever gloss input is currently focused — click a field first, then click a chip. The result is dot-notation: cat.PL.POSS.

PL SG FEM MASC NEUT 1SG 2SG 3SG 1PL 2PL 3PL GEN POSS LOC REL

You can also type freely into any gloss field — the chips are shortcuts, not restrictions. A Turkish example:

Phrase: benim kırmızı kedilerim Gloss: 1SG.GEN red cat.PL.POSS Translation: my red cats

Gloss suggestions

If a token has been glossed before in the current session, its gloss is pre-filled from the Lexicon. When a word has multiple senses (different glosses in different contexts), the field shows the first recorded sense — edit it freely. The Lexicon tab shows all senses for every word and lets you edit them inline.

Phrase translation

The Translation field at the bottom of the Gloss card captures the full English rendering of the entire noun phrase (e.g. red cat for the noun phrase kırmızı kedi in Turkish). The translation of noun phrase is required to be completed before saving.


Saving a phrase

Click Save phrase → (the primary button below the Tag and Gloss cards) to commit a phrase to the session. The tool validates all three requirements before saving:

  1. All tokens tagged — every token chip must show a tag label and a green border. Untagged tokens pulse red.
  2. All tokens glossed — every gloss input must be non-empty. Empty fields are highlighted in red.
  3. Phrase translation provided — the translation field at the bottom of the Gloss card must be filled.

After saving, the row in the Rows list is marked done, the phrase workspace resets, and the progress bar advances. The saved phrase immediately appears in the Data tab.


Data tab

The Data tab shows all saved phrases in the current session. Each entry displays the phrase ID, the phrase text, language and code chips, the context sentence, phrase translation, and the full ordered tag sequence.

Individual entries can be removed with the Remove button. Clear all wipes the entire session's annotations (with confirmation). Exports are available at the top of the tab.


Lexicon

The Lexicon tab maintains a persistent vocabulary of every word form encountered during annotation. Each entry has a unique code (LEX-0001, LEX-0002…) keyed by word form and language. Words can have multiple senses (e.g. akıllı → "smart" in one context, "clever" in another), each with its own gloss and list of phrase IDs where that sense was used.

Empty glosses block export. If any lexicon entry has an empty gloss, the export buttons are disabled and a warning badge appears in the toolbar. Fill in all glosses before exporting.

The Lexicon can be exported separately as JSON, CSV, or XLSX from the Lexicon tab. Exporting the lexicon JSON and re-importing it in a future session carries over all codes and glosses, so the same word always receives the same LEX-XXXX code across datasets.


Categories

The Categories tab lets you customise the full three-level tag hierarchy. All changes apply immediately to the tag picker in the Annotate tab.

  • +
    Add main category — type a name in the input at the bottom and click Add. An ID is generated automatically from the name.
  • +
    Add subcategory — click + Subcategory next to any category and enter a name. The ID is prefixed with the parent's ID.
  • +
    Add type — click + Type next to any subcategory. Types are the third and final level.
  • ×
    Delete — any level can be deleted with the × button. Existing annotations that used a deleted tag are not affected.

Auto-save

NP Annotator automatically saves the full session state to localStorage after every state-changing action: saving a phrase, deleting an annotation, editing the lexicon, or modifying categories. No manual save step is required.

The auto-save bar below the header shows a green status dot, the timestamp of the last save, and counts of saved phrases and lexicon entries. The dot briefly flashes amber when a save is in progress.

Resuming a session

On the next page load, if saved data is found, a blue banner appears at the top of the setup screen: "Saved session found — N phrases annotated." Click Resume session to restore everything instantly, or Start fresh to begin a new session (the old data remains in storage until cleared).

Clearing saved data

Export before clearing. The 🗑 Clear saved data button in the auto-save bar permanently deletes all session data from localStorage. This cannot be undone. Always export your annotations first.

Storage limits vary by browser but are typically 5–10 MB per origin. Sessions with thousands of rows and hundreds of annotations stay well within this limit.


Export formats

Exports are available from the Data tab (annotations) and the Lexicon tab (vocabulary). All exports use UTF-8 with a BOM so files open correctly in Excel with non-Latin scripts.

.xlsx (annotations)
Three sheets: phrases, tokens, annotations. Includes a type column. Column widths auto-sized.
.csv (annotations)
Three normalised files: np_phrases_*.csv, np_tokens_*.csv, np_annotations_*.csv. Primary/foreign key relationships for direct DB import.
.json (annotations)
A single structured object with session, phrases, tokens, and annotations arrays.
.xlsx / .csv / .json (lexicon)
Exported from the Lexicon tab. Re-import lexicon JSON in a future session to carry forward LEX-XXXX codes and glosses.

Database normalisation (CSV)

The three CSV files form a relational schema:

phrases.csv → phrase_id (Primary Key) tokens.csv → tokenId, phraseId (FK → phrases.phrase_id) annotations.csv → annotation_id, phrase_id (FK → phrases.phrase_id)

Tips & shortcuts

Select multiple tokens Click non-adjacent tokens to build a multi-token selection (e.g. kapıyı açan as one RC unit). All selected tokens receive the same tag.
Fix NP errors inline If the noun phrase in your data is wrong, edit it in the NP field and click Reload. The context sentence is never changed.
Use gram chips efficiently Click a gloss field to focus it, then click chips in sequence. Tab moves to the next token's field.
Reuse the lexicon across sessions Export Lexicon JSON after finishing a dataset. Re-import it at the start of the next session to keep LEX-XXXX codes consistent.
Always export before clearing The Clear saved data button is permanent. Download your XLSX or JSON from the Data tab first.
Back to main via logo Click NP Annotator in the top-left to return to the setup screen at any time. A Resume button appears immediately.

Interaction reference

Action How
Select a tokenClick the token chip in the phrase stage
Select multiple tokensClick each token individually (no modifier key needed)
Deselect a tokenClick the selected token again
Clear all selectionsClear selection button in the Noun Phrase card header
Reset entire phraseReset button — clears tokens and all annotations
Apply tag (no types)Category → Subcategory — applies immediately
Apply tag (with types)Category → Subcategory → Type
Navigate back in picker← Back link at the top of Step 2 or Step 3
Confirm suggestionConfirm button in the amber banner
Dismiss suggestionChange or Dismiss button in the banner
Append grammatical glossFocus a gloss field, then click a teal chip
Save phraseSave phrase → button (all validation must pass)
Edit categories✎ Edit categories link in Tag selection card header