This page describes the file formats
of two downloadable versions of the 2011 CoLAG domain. A more
detailed description of the domain can be found in Sakas
& Fodor (2012). A technical report that gives a fully
specified description of the multi-language ‘supergrammar’ that
generated the domain is available here.
Description of the original version of the domain (circa 2003) and
results from early investigations can be found in Sakas
(2003) and Fodor
and Sakas (2004) and elsewhere.
Any questions or difficulties concerning the downloadable files
should be directed to William Sakas: sakas@hunter.cuny.edu.
The CoLAG domain is downloadable in two formats:
Format 1) as a (large) single 'flat'
file.
Download: COLAG_2011_flat.zip
(~56 MB)
Format 2) as four separate (smaller) files.
Download: COLAG_2011_relational.zip
(~13 MB)
The information contained in each
format is identical. Format 1) is intended for batch (beginning to
end) processing and Format 2) is suitable for importing into a
relational database system for interactive querying.
Format 1) contains much redundancy. For example, the CoLAG
sentence S Verb O1 (subject verb direct-object)
exists in over 400 CoLAG languages and is repeated for each
language it exists in; albeit often with a different tree
structure. But Format 1) is easy to handle if your preferred
programming style is to process from beginning to end collecting
summary data for future analysis. Format 2) has most of the
redundancy removed and is importable into a relational database
system for interactive querying (though we have also found that a
single non-relational SQL table created by importing the large
flat file works quite well if the proper indices are generated
after the data has been imported). The specification of the file
formats is given below.
Sentences in CoLAG consist of sequences of non-null lexical items (e.g., S, O1, Adv, Aux, Verb, etc.) and non-null features (e.g., DEC, Q, WH, etc.). Some example sentences are:
S Aux[+FIN] Verb Adv [ILLOC DEC]
Verb[+FIN] S P O3 [ILLOC
Q]
(P is preposition, O3 is object of preposition)
O1[+WH] S O2 Verb[+FIN] ka [ILLOC
Q] (ka is a question
marker)
Verb[-FIN] O1 [ILLOC IMP]
Note that in the downloadable files, the illocutionary force feature (e.g., [ILLOC Q]) is maintained in its own column separated from the rest of the sentence. Also for readability, in these files the finiteness feature, ([+FIN]) is not shown in the sentences – it can be easily generated (for a sentence) if needed by:
If [ILLOC DEC] OR
[ILLOC Q] # if a declarative or a question
if there
is an Aux in the sentence [+FIN] is attached to the Aux
else
[+FIN] is attached to the Verb
# Note: Verb in an imperative does not receive the [+FIN] feature
For both formats, CoLAG grammars are
formulated in a principles and parameters framework and are
represented in the downloadable files as a string of 13 zeros
and/or ones which correspond to one or the other values of the
thirteen binary parameters that distinguish the grammars in the
domain. Individual grammars all combine a universal component (UG)
with their relevant parameter values (see the Supergrammar).
The parameters and their values are listed in the table below. The value of the first
parameter (P1) is the leftmost character in the string and the
value of the last parameter (P13) is the rightmost. For example
the grammar: 0001001100011 licenses a language that is
subject-initial, head-initial, complementizer-initial, has
optional topics, no null subjects, no null topics, wh-movement,
preposition stranding, no topic marking, no V to I movement, no I
to C movement, it has affix hopping and question inversion.
It is important to note that the use of 0's and 1's to designate
parameter values is for notational compactness. Actual parameter
values take the form of 'treelets' (small fragments of tree
structure); see discussion in Fodor
1998a and Sakas
& Fodor 2012 - around Figure 2.
Also to be noted is that the choices of which linguistic
phenomena are coded here as 0 and which as 1 are somewhat
arbitrary; 0 does not necessarily imply a default value. On the
role of default values in learning, see discussion in Sakas
& Fodor (2012).
P1 |
Subject Position |
0 |
Subject Initial Subject Final |
P2 |
Headedness in IP,
NegP, VP, PP |
0 |
Head Initial Head Final |
P3 |
Headedness in CP |
0 |
Complementizer
Initial Complementizer
Final |
P4 |
Optional (versus
obligatory) Topic |
0 |
Obligatory Topic Optional Topic |
P5 |
Null Subject |
0 |
No Null Subject Optional Null Subject |
P6 |
Null Topic |
0 |
No Null Topic Optional Null Topic |
P7 |
Wh-movement |
0 |
No Wh-Movement Obligatory
Wh-Movement |
P8 |
Preposition
Stranding |
0 |
Pied Piping Preposition
Stranding |
P9 |
Topic Marking |
0 |
No Topic Marking Topic Marking |
P10 |
VtoI Movement |
0 |
No VtoI Movement Obligatory VtoI
Movement |
P11 |
ItoC Movement |
0 |
No ItoC Movement Obligatory ItoC
Movement |
P12 |
Affix Hopping |
0 |
No Affix Hopping Affix Hopping |
P13 |
Q-Inversion (ItoC
in questions) |
0 |
No Question
Inversion Obligatory Question
Inversion |
These conventional names for the parameters are
used for convenient reference, but please bear in mind that the
actual linguistic consequences of these parameters are not fully
self-evident because they depend to various extents on how they
interact with each other and with the ‘universal grammar’ (the UG)
of the CoLAG domain.
Syntactic tree structures are identical in both Format 1) and Format 2). CoLAG uses bracketed tree notation, which deviates from standard bracketed notation in that parentheses rather than square brackets are used to demarcate constituents. Square brackets are used to demarcate features. Terminals are surrounded by double quotes, though some terminals or features (e.g., [+NULL], [SLASH S]) may not be realized in the surface sentence. For example, one bracketed tree structure (of several) for the sentence S Aux Verb O1 O2 [ILLOC DEC] in CoLAG is:
(CP[ILLOC
DEC][+FIN]"S[+TOPIC]"(Cbar[ILLOC DEC][+FIN][SLASH S](C[ILLOC
DEC][+FIN]"THAT[+NULL]")(IP[ILLOC DEC][+FIN][SLASH
S]"S[+NULL][SLASH S]"(Ibar[ILLOC DEC][+FIN](I[ILLOC
DEC][+FIN]"Aux[+FIN]")(VP(Vbar(V"Verb")"O1""O2"))))))
The large flat file where each row contains
data relevant for a single sentence as generated with a particular
tree structure by a particular grammar (an ambiguous sentence will
appear on more than one row with different tree structures). Data
is separated by tabs (i.e., the file is 'tab-delimited'). Lines
are terminated in MS Windows style, i.e., carriage return/line
feed: ASCII codes 13 and 10 respectively. There are 7
columns. There is no header line, though the column headers/field
names we use for each are shown below in the table after the
column number. Though not a 'necessary requirement', we would
encourage users of the domain to be consistent with us in this
regard.
Columns 1-4 contain the linguistic data. Columns 5, 6 and 7 are
integer identifications (IDs) for the grammar, sentences and tree
structures. These are included for efficiency; most programming
languages are faster at comparing/manipulating numbers rather than
strings and for certain queries the string representation is
unneeded (e.g., Which CoLAG languages are subsets of other CoLAG
languages?).
An integer ID for the grammar in Column 1. This is
simply the decimal value (base 10) of the grammar’s
binary (base 2) representation.
Column 1
(gramm):
A principles and
parameters grammar consisting of a string of thirteen
zeros and/or ones, e.g., 0001001100011.
Column 2 (illoc):
The illocutionary force
of the sentence consisting of one of Q, DEC or IMP
(question, declarative or imperative).
Column 3 (sent):
The overt tokens and
features that make up the sentence, e.g., S Aux Verb O1
O2.
Column 4 (struct):
A bracketed
representation of one tree structure for the sentence,
given the grammar.
Column 5 (grammID):
Column 6 (sentID):
An arbitrarily assigned
integer ID of the combination of the illocutionary force
in Column 2 combined with the sentence in Column 3. For
example Aux Verb
O1 [ILLOC DEC] would have a different ID than Aux Verb O1 [ILLOC
Q].
Column 7 (structID):
An arbitrarily assigned
integer ID for the structure in Column 4.
Columns 1-4 contain text data. The maximum character widths of the
columns are:
Column 1: 13 characters
Column 2:
3 characters
Column 3: 50 characters
Column 4: 550 characters
The description of this format
makes reference to the columns and column headers outlined in
Format 1) immediately above.
This format manages grammars, sentences and
structures in three separate files together with a fourth file
relating the three data files together. Columns are
tab-delimited, there are no header lines.
The grammars file, COLAG_2011_gramms.txt, contains two columns:
grammID and gramm.
The sentences file, COLAG_2011_sents.txt,
contains three columns: sentID, illoc and sent.
The tree structures file, COLAG_2011_structs.txt,
contains two columns: structID and struct.
The IDs file, COLAG_2011_IDs.txt, contains three columns:
grammID, sentID and structID.
The grammars file, the sentences file and the tree structures file
contain relevant linguistic information with redundancies removed
- in each file there is one row for each unique element (grammar,
sentence, or tree structure) in CoLAG. The IDs file "ties" the
relationships between the grammars, sentences and structures
together (i.e., which sentences and corresponding tree structures
are licensed by which CoLAG grammars). The IDs file has exactly
the same number of rows as the flat file described in Format 1),
but without linguistic information. Database "joins" or "views"
can can be used to extract the linguistic information when
required.