Common Vocabulary Translator¶
The Common Vocabulary Translator (CVT) is a software program that translates the nearly 15,000 Nomenclature 4.0 terms into the simpler Common Vocabulary terms used in the Digital Archive. It also adds additional terms to the Common Vocabulary that do not exist in Nomenclature.
This page explains:
Advanced Topic
This information is for anyone who wants to know how the Common Vocabulary gets created, but it is written for someone who will be working with the CVT software and therefore is fairly technical.
Terminology¶
To understand the CVT, become familiar with the terminology below.
Nomenclature hierarchy¶
The Nomenclature term hierarchy can be up to six levels deep, though not every term uses all six levels. The levels are shown below, but note that Nomenclature uses the word term to mean both the entire term and each of the last three levels of the complete term. This can be confusing which is why this documentation only uses term to mean the entire set of words for a complete term. It refers to the last three levels of a complete Nomenclature term as the Primary, Secondary, and Tertiary parts.
1 Category
2 Class
3 Sub Class
4 Primary term
5 Secondary term
6 Tertiary term
Here are three examples of Nomenclature terms:
Category 01: Built Environment Objects
Building Components
Construction Materials
Building Stone
Dimension Stone
Dressed Stone
Category 07: Distribution & Transportation Objects
Land Transportation T&E
Animal-Powered Vehicles
Carriage
Buckboard
Category 08: Communication Objects
Documentary Objects
Graphic Documents
Photograph
Negative
Common Vocabulary hierarchy¶
The CVT translates Nomenclature terms into simpler Common Vocabulary terms based on a set of translation rules that will be explained later. The rules tell the CVT how to translate the terms shown in the previous section into the terms shown below.
Object, Building Stone, Dimension Stone, Dressed Stone
Transportation, Carriage, Buckboard
Image, Photograph, Negative
As you can see, Common Vocabulary terms are typically much simpler and easier to read than Nomenclature terms. In fact, approximately 95% of Common Vocabulary terms have four or fewer levels in their hierarchy. Only about 5% have five levels and none have six.
Leaf¶
Leaf means the words at the deepest level in a hierarchy. In examples in both sections above,
the leaf words are: Dressed Stone
, Buckboard
, and Negative
.
The CVT preserves Nomenclature leaf words to ensure that they are the same as, and can be
matched with, the same leaf words used in other collection software, like PastPerfect,
that uses Nomenclature.
Nomenclature 4.0 supports leaf words in both inverted and natural order. Examples of each follow.
Inverted Order | Natural Order |
---|---|
Negative, Glass Plate | Glass Plate Negative |
Negative, Roll File | Roll Film Negative |
Negative, Sheet Film | Sheet Film Negative |
The Common Vocabulary uses natural order because it's easier to read.
Important
Commas are not allowed in Common Vocabulary leaf words because
comma is reserved as a hierarchy level indicator. This is particularly
important in place names where comma is a commonly used separator.
So for example, in the Common Vocabulary use Bangor ME
as the leaf,
not Bangor, ME
.
Tail¶
Tail is a CVT term used in the translation rules. The tail consists of a Nomenclature term's Primary, Secondary, and Tertiary parts, if all three exist. If a term has no Tertiary part, the tail consists of the Primary and Secondary parts. If a term has no Secondary part, the tail is just the Primary part. Some higher level terms have no Primary part, but still have a leaf which is either the Sub Class or Class part. In those cases, the tail and the leaf are the same.
Translation rationale¶
The purpose of the Common Vocabulary is to fulfil the need for a rich, practical, and easy to read set of vocabulary terms that archivists can use in cataloging collections which not only contain physical objects like those found in museums, but also contain items about people, places, structures, events, and organizations.
Nomenclature has thousands of terms for naming human-made objects, but none for things like plants or animals, businesses or organizations, or places like towns and villages. Object names alone are not sufficient for cataloging many Digital Archive items, especially Reference Items which serve as stand-ins for real-world entities like people, boats, and houses.
While Nomenclature by itself does not fill the need, it does provide the hierarchical organization, and most of the leaf words, used by the Common Vocabulary.
The rationale for how Nomenclature gets translated to Common Vocabulary, is provided in the following sections.
Common Vocabulary Type and Subject¶
Before continuing, it is important to explain the proper use of, Type and Subject. Note that throughout this documentation, when the words Type and Subject appear in small caps, they refer to an item's Type and Subject metadata fields.
Every item must have a Type – there are no exceptions.
Every item must also have a Subject except in certain cases
which are explained below. The Subject is required to further classify the Type. For
example, an item of Type Image, Photograph
must have a Subject to
indicate the nature of the picture since Photograph
alone is too vague. For example, the subject of a
photograph could be People
to indicate that it's a photo of humans.
When a Subject is optional¶
When an item's Type begins with Object
, a Subject is not required, meaning that you can
save the item without choosing a Subject. For example, an item with Type Object, Cup, Teacup
needs no further classification and requires no Subject. However, the Type
Object, Art, Sculpture, Carving
is vague and so a Subject likeNature, Animals, Birds
is recommended.
When a Subject is required, but not needed¶
The rule that a Subject is required unless the Type begins with Object
works well in general,
but there are some non-object types that are self-evident and don't need a Subject.
For example Type Document, Log, Ship's Log
doesn't need further classification, but if you omit
the Subject, you'll get an error when you attempt to save the item. In that case, you can choose the
special Subject none
to override the requirement for a Subject so that you can save the item.
Note that whenever you edit that item, you'll have to choose none
again to save it. That's because the
item gets saved without a Subject – the Subject is not getting set to none
. Use this
feature judiciously.
Top level types and subjects¶
The table below shows the top level Type and Subject terms in the Common Vocabulary.
Type | Subject |
---|---|
Document | Businesses |
Image | Events |
Map | Nature |
Object | Object |
Publication | Organizations |
Reference | Other |
Set | People |
Places | |
Recreation | |
Structures | |
Transportation | |
Vessels |
Archivists at the Southwest Harbor Public Library derived the terms based on four years working with the collections of cultural heritage organizations on Mount Desert Island in Maine. Together, the collections contain more than 20,000 items, nearly 25,000 images, and over 3,000 documents. They chose simple words that a) reflect the focus of the collections and b) have obvious meaning. While the choices are subjective, they are meeting the cataloging needs of many organizations.
Top level translations¶
The table below gives a general sense of which Nomenclature Categories translate to which Common Vocabulary top level terms. To see how specific Classes, Subclasses, and individual terms translate, you can look at the translation rules that the CVT follows.
Nomenclature Category | Common Vocabulary |
---|---|
Category 01: Built Environment Objects | All translate to Object except for terms with Nomenclature Class Structures which translates to Subject Structures and to Type Object, Structures |
Category 02: Furnishings | All translate to Object |
Category 03: Personal Objects | All translate to Object |
Category 04: Tools & Equipment for Materials | All translate to Object |
Category 05: Tools & Equipment for Science & Technology | All translate to Object |
Category 06: Tools & Equipment for Communication | All translate to Object except Picture Postcard which translates to Image |
Category 07: Distribution & Transportation Objects | All translate to Object except for transport vehicles like cars, boats, and trains which are translated to Subject Transportation or Vessels and to Type Object, Transportation or Object, Vessels |
Category 08: Communication Objects | Most translate to Document , Image , Map or Publication unless they are none of those, in which case they translate to Object . |
Category 09: Recreational Objects | All translate to Object |
Category 10: Unclassifiable Objects | All translate to Object |
Note: Although vessels are a form of transportation, Vessels
is elevated to a top level Subject term
because boats and ships are so prominent in the collections of coastal communities.
Rationale for how Nomenclature terms get translated¶
There are many ways that Nomenclature could be morphed into a Common Vocabulary.
Here is an explanation of, and the reasoning behind, the way the CVT does it.
Type vocabulary¶
All Nomenclature terms are translated to the Type vocabulary. In other words, every term in Nomenclature can be found in the Type vocabulary.
The top-level terms Structures
, Transportation
, and Vessels
are included in the Type vocabulary as
sub-types of Object
(e.g. Object, Structures
) so that they are not top-level Type terms which
would be rarely used in most collections. They are however top-level terms in the Subject vocabulary.
For instance, most collections won't have any items of type Transportation, Automobile
, but many will
have photograph or document items having the Subject Transportation, Automobile
. If a collection does
contain a car, the item's Type would be Object, Transportation, Automobile
.
Subject vocabulary¶
Many, but not all Nomenclature terms are also translated to the Subject vocabulary.
Translated: Nomenclature terms for physical, three dimensional objects, are translated to the Subject vocabulary.
Not translated:
Objects that are more or less two dimensional are not translated to the Subject vocabulary. They
are translated to the Type vocabulary as Document
, Image
, Map
, or Publication
.
Rationale¶
Terms for physical objects become both Type and Subject terms because collections generally have either an actual object such as a hat, or they have a photograph of, or documents about, an object such as a picture of a person wearing a hat.
For an actual hat that's in the collection,
the item's Type would be Object, Clothing, Hat
with no Subject. For a photograph
of someone wearing a hat, the item's Type would be
Image, Photograph
and the Subjects would be People
and Object, Clothing, Hat
.
The idea is that terms used to classify a physical object can usually be used for either Type or Subject (but not both) depending on whether it's a physical item is in the collection (Type) or if the item in the collection depicts, or is about, the item (Subject).
Terms not in Nomenclature¶
Terms for things like people, animals, businesses, organizations, and events that do not exist in Nomenclature, have been added to the Subject vocabulary.
Terms for the names of towns and villages are in the Place vocabulary.
The term Reference
has been added to the Type vocabulary.
Learn about Reference Items.
The term Set
has been added to the Type vocabulary.
Learn about Item Sets.
To see exactly what non-Nomenclature terms the CVT adds to the Common Vocabulary, you can look at the additional terms file which tells the CVT what terms to add.
Translation process¶
This section explains how the CVT actually performs the translation from Nomenclature terms to Common Vocabulary terms. In the diagram below, the black box in the middle represents the CVT software. The boxes in the top row represent data files that the CVT reads, and the boxes in the bottom row represent data files that the CVT creates.
Translation Rules¶
The translation rules are contained in a translations file which is a spreadsheet of rows and columns in CSV format. Each row in the translation tables defines one translation rule.
Every rule column is optional except for Category and Translation. If the Category or Translation columns is blank, the CVT ignores the entire row. This allows you to place blank rows between groups of rules.
The CVT processes terms in the Nomenclature file one at a time to translate them into corresponding terms in the Common Vocabulary file. For each Nomenclature term, the CVT looks for a matching translation rule by starting with the first rule in the translation file, and going to the next rule, until it finds a match.
The CVT ignores Nomenclature rows having a level column value of 1 or 2 because those rows have no leaf words and therefore do not represent a Nomenclature terms. A level 1 row only has a Category value. A level 2 row has a Category and a Class.
To find a matching rule, the CVT interprets each rule by comparing the rule's A - F column values, from left to right, against the corresponding values from the Nomenclature row being processed. If every non-blank rule column value matches the corresponding Nomenclature row value, the CVT applies the rule, otherwise it skips to the next rule. The CVT only applies one rule to a row (it does not keep looking for matching rules once it finds the first match).
Important: More restrictive rules must occur in the translation file above less restrictive rules, otherwise, the CVT will apply a less restrictive rule before it encounters a more restrictive rule. For example, a rule that applies to any Nomenclature row having a certain Class, must appear after a more restrictive rule that applies only to rows of that Class that have a specific Sub-class or Primary part.
Translation rule columns¶
The table below explains each of the translation rule columns.
Column | Name | Usage |
---|---|---|
A | Category | Match a value from the Nomenclature Natural_Order_EN_Category column |
B | Class | Match a value from the Nomenclature Natural_Order_EN_Class column |
C | Sub_Class | Match a value from the Nomenclature Natural_Order_EN_Sub_Class column |
D | Primary | Match a value from the Nomenclature Natural_Order_EN_Primary_Term column |
E | Secondary | Match a value from the Nomenclature Natural_Order_EN_Secondary_Term column |
F | Identifier | Match a value from the Nomenclature Identifier column. Use this column when the rule should only apply to a single Nomenclature term. |
G | Translation | A pattern specifying the Common Vocabulary term. The pattern contains one or more of the substitution elements explained in the next section. |
H | Replace | Zero or more pairs of values in double quotes, separated by a comma, specifying before and after values. the CVT applies the replacement to the translated text after performing the translation. |
J | Notes | Comments (ignored by the CVT) |
As an example, if a rule's Replace column contained the pair "Lodging Facility", "Lodging"
,
matching the term Structures, Commercial, Lodging Facility, Hotel
, the term
would get changed to Structures, Commercial, Lodging, Hotel
.
Substitution Elements¶
The Translation column of a translation rule must contain one or more of the following substitution elements:
Substitution | Meaning |
---|---|
{class} |
The value of the Natural_Order_EN_Class column of the Nomenclature row. Used when the Nomenclature class should be included as-is in the resulting Common Vocabulary term. |
{sub_class} |
The value of the Natural_Order_EN_Sub_Class column of the Nomenclature row. Used when the Nomenclature sub class should be included as-is in the resulting Common Vocabulary term. |
{tail} |
The tail of the Nomenclature term. Used for most rules. |
{leaf} |
The leaf of the Nomenclature term. Used instead of {tail} only when the tail value contains more levels of information than are necessary for the resulting Common Vocabulary term. |
The {tail}
and {leaf}
elements are mutually exclusive and so only one or the other should be used. These two terms
are only ever used at the end of the translation text.
Examples of rules in the Translation column:
Transportation|{sub_class}|{tail}
Object|Tools & Equipment|{class}|{tail}
Object|Armaments|{leaf}
Translation Process¶
The CVT processes one Nomenclature row at a time. When the CVT finds a matching rule for a row, it applies the rule to the row as follows:
- Perform the substitutions listed in the table above
- Make any replacements specified in the Replace column
- Emit the resulting string as the Common Vocabulary term
Translation software¶
The CVT is implemented as a Python program that was developed by George Soules of AvantLogic Corporation. The source code and data files are available as open source in the AvantCommonVocabulary repository on GitHub.
Usage¶
>>> build_common_vocabulary.py
Input Files¶
The program reads these files from the /data
folder located in the Python script's folder.
input-nomenclature-sortEn_2020-05-18.csv
input-translations.csv
input-additional-terms.csv
input-previous-digital-archive-vocabulary.csv
Outputs Files¶
The program creates these files in the /data
folder located in the Python script's folder.
It also uploads the files to digitalarchive.us/vocabulary
digital-archive-vocabulary.csv
digital-archive-diff.csv
Notes¶
-
The
digital-archive-diff.csv
will contain update instructions if the CVT found differences betweeninput-previous-digital-archive-vocabulary
and the newly createddigital-archive-diff.csv
. Learn how to apply the updates to Digital Archive sites. -
If
input-previous-digital-archive-vocabulary
does not exist as an input file:digital-archive-diff.csv
will be empty because there can be no differences without a previous vocabulary to compare to.- The CVT will create a new copy of
input-previous-digital-archive-vocabulary.csv
(the dashed line in the CVT diagram above) by copying the newly generated
digital-archive-vocabulary.csv
file.
Nomenclature data¶
To get the latest version of Nomenclature 4.0:
- Download it as an Excel file
from
https://www.nomenclature.info/api/download/dataset/nom/nomenclature-sortEn.xlsx
- Open the
.xlsx
file and save it as a CSV file - If Excel displays a
We found a problem
error, clickYES
to recover the file - Save the file as
CSV UTF-8
- Delete the
.xlsx
file since it's no longer needed
To compare old and new versions of Nomenclature using the Beyond Compare program,
do a text compare, not a table compare.