CoLAG Language Domain Dowloadables

This page describes the file formats of two downloadable versions of the 2011 CoLAG domain. A more detailed description of the domain can be found in Sakas & Fodor (2012). A technical report that gives a fully specified description of the multi-language ‘supergrammar’ that generated the domain is available here. Description of the original version of the domain (circa 2003) and results from early investigations can be found in Sakas (2003) and Fodor and Sakas (2004) and elsewhere.

Any questions or difficulties concerning the downloadable files should be directed to William Sakas: sakas@hunter.cuny.edu.

The Formats

The information contained in each format is identical. Format 1) is intended for batch (beginning to end) processing and Format 2) is suitable for importing into a relational database system for interactive querying.

Format 1) contains much redundancy. For example, the CoLAG sentence S Verb O1 (subject verb direct-object) exists in over 400 CoLAG languages and is repeated for each language it exists in; albeit often with a different tree structure. But Format 1) is easy to handle if your preferred programming style is to process from beginning to end collecting summary data for future analysis. Format 2) has most of the redundancy removed and is importable into a relational database system for interactive querying (though we have also found that a single non-relational SQL table created by importing the large flat file works quite well if the proper indices are generated after the data has been imported). The specification of the file formats is given below.

The Sentences

Sentences in CoLAG consist of sequences of non-null lexical items (e.g., S, O1, Adv, Aux, Verb, etc.) and non-null features (e.g., DEC, Q, WH, etc.). Some example sentences are:

Note that in the downloadable files, the illocutionary force feature (e.g., [ILLOC Q]) is maintained in its own column separated from the rest of the sentence. Also for readability, in these files the finiteness feature, ([+FIN]) is not shown in the sentences – it can be easily generated (for a sentence) if needed by:

else [+FIN] is attached to the Verb
# Note: Verb in an imperative does not receive the [+FIN] feature

All features (including [+FIN] and surface-null features, e.g., SLASH) appear in the bracketed tree structures of the sentences described below.

The Grammars

For both formats, CoLAG grammars are formulated in a principles and parameters framework and are represented in the downloadable files as a string of 13 zeros and/or ones which correspond to one or the other values of the thirteen binary parameters that distinguish the grammars in the domain. Individual grammars all combine a universal component (UG) with their relevant parameter values (see the Supergrammar).

The parameters and their values are listed in the table below. The value of the first parameter (P1) is the leftmost character in the string and the value of the last parameter (P13) is the rightmost. For example the grammar: 0001001100011 licenses a language that is subject-initial, head-initial, complementizer-initial, has optional topics, no null subjects, no null topics, wh-movement, preposition stranding, no topic marking, no V to I movement, no I to C movement, it has affix hopping and question inversion.

It is important to note that the use of 0's and 1's to designate parameter values is for notational compactness. Actual parameter values take the form of 'treelets' (small fragments of tree structure); see discussion in Fodor 1998a and Sakas & Fodor 2012 - around Figure 2.

Also to be noted is that the choices of which linguistic phenomena are coded here as 0 and which as 1 are somewhat arbitrary; 0 does not necessarily imply a default value. On the role of default values in learning, see discussion in Sakas & Fodor (2012).

P1	Subject Position	Subject Initial Subject Final
P2	Headedness in IP, NegP, VP, PP	Head Initial Head Final
P3	Headedness in CP	Complementizer Initial Complementizer Final
P4	Optional (versus obligatory) Topic	Obligatory Topic Optional Topic
P5	Null Subject	No Null Subject Optional Null Subject
P6	Null Topic	No Null Topic Optional Null Topic
P7	Wh-movement	No Wh-Movement Obligatory Wh-Movement
P8	Preposition Stranding	Pied Piping Preposition Stranding
P9	Topic Marking	No Topic Marking Topic Marking
P10	VtoI Movement	No VtoI Movement Obligatory VtoI Movement
P11	ItoC Movement	No ItoC Movement Obligatory ItoC Movement
P12	Affix Hopping	No Affix Hopping Affix Hopping
P13	Q-Inversion (ItoC in questions)	No Question Inversion Obligatory Question Inversion

These conventional names for the parameters are used for convenient reference, but please bear in mind that the actual linguistic consequences of these parameters are not fully self-evident because they depend to various extents on how they interact with each other and with the ‘universal grammar’ (the UG) of the CoLAG domain.

The Trees

Syntactic tree structures are identical in both Format 1) and Format 2). CoLAG uses bracketed tree notation, which deviates from standard bracketed notation in that parentheses rather than square brackets are used to demarcate constituents. Square brackets are used to demarcate features. Terminals are surrounded by double quotes, though some terminals or features (e.g., [+NULL], [SLASH S]) may not be realized in the surface sentence. For example, one bracketed tree structure (of several) for the sentence S Aux Verb O1 O2 [ILLOC DEC] in CoLAG is:

(CP[ILLOC DEC][+FIN]"S[+TOPIC]"(Cbar[ILLOC DEC][+FIN][SLASH S](C[ILLOC DEC][+FIN]"THAT[+NULL]")(IP[ILLOC DEC][+FIN][SLASH S]"S[+NULL][SLASH S]"(Ibar[ILLOC DEC][+FIN](I[ILLOC DEC][+FIN]"Aux[+FIN]")(VP(Vbar(V"Verb")"O1""O2"))))))

Flat file (Format 1)

The large flat file where each row contains data relevant for a single sentence as generated with a particular tree structure by a particular grammar (an ambiguous sentence will appear on more than one row with different tree structures). Data is separated by tabs (i.e., the file is 'tab-delimited'). Lines are terminated in MS Windows style, i.e., carriage return/line feed: ASCII codes 13 and 10 respectively. There are 7 columns. There is no header line, though the column headers/field names we use for each are shown below in the table after the column number. Though not a 'necessary requirement', we would encourage users of the domain to be consistent with us in this regard.

Columns 1-4 contain the linguistic data. Columns 5, 6 and 7 are integer identifications (IDs) for the grammar, sentences and tree structures. These are included for efficiency; most programming languages are faster at comparing/manipulating numbers rather than strings and for certain queries the string representation is unneeded (e.g., Which CoLAG languages are subsets of other CoLAG languages?).

Column 1 (gramm):	A principles and parameters grammar consisting of a string of thirteen zeros and/or ones, e.g., 0001001100011.
Column 2 (illoc):	The illocutionary force of the sentence consisting of one of Q, DEC or IMP (question, declarative or imperative).
Column 3 (sent):	The overt tokens and features that make up the sentence, e.g., S Aux Verb O1 O2.
Column 4 (struct):	A bracketed representation of one tree structure for the sentence, given the grammar.
Column 5 (grammID):	An integer ID for the grammar in Column 1. This is simply the decimal value (base 10) of the grammar’s binary (base 2) representation.
Column 6 (sentID):	An arbitrarily assigned integer ID of the combination of the illocutionary force in Column 2 combined with the sentence in Column 3. For example Aux Verb O1 [ILLOC DEC] would have a different ID than Aux Verb O1 [ILLOC Q].
Column 7 (structID):	An arbitrarily assigned integer ID for the structure in Column 4.

Relational database (Format 2)

The description of this format makes reference to the columns and column headers outlined in Format 1) immediately above.

This format manages grammars, sentences and structures in three separate files together with a fourth file relating the three data files together. Columns are tab-delimited, there are no header lines.

The grammars file, COLAG_2011_gramms.txt, contains two columns: grammID and gramm.
The sentences file, COLAG_2011_sents.txt, contains three columns: sentID, illoc and sent.
The tree structures file, COLAG_2011_structs.txt, contains two columns: structID and struct.

The IDs file, COLAG_2011_IDs.txt, contains three columns: grammID, sentID and structID.

The grammars file, the sentences file and the tree structures file contain relevant linguistic information with redundancies removed - in each file there is one row for each unique element (grammar, sentence, or tree structure) in CoLAG. The IDs file "ties" the relationships between the grammars, sentences and structures together (i.e., which sentences and corresponding tree structures are licensed by which CoLAG grammars). The IDs file has exactly the same number of rows as the flat file described in Format 1), but without linguistic information. Database "joins" or "views" can can be used to extract the linguistic information when required.

CUNY-CoLAG Language Domain Downloads