mixxt

Sign up here for free!

Welcome to mixxt!

ISO standardisation

link this post written on 05/02/2010
  • To top
Hi Keith,

The link to the latest working copy in your previous post seems to be incomplete?
http://www.xces.org/ns/GrAF/0.99.
does not exist. I found
http://www.xces.org/ns/GrAF/0.99.1
which links to a dtd and a schema, but I cannot relate these resources to your description (no "layer", no "as" at the graph level etc)?

Han
link this post written on 05/02/2010
  • To top
Hi Han,

The period at the end of the sentence is being included as part of the link... It should be

http://www.xces.org/ns/GrAF/0.99
link this post written on 05/02/2010
  • To top
Ah, apologies.
I did look at that location but saw no changes. Must have forgotten to refresh the page(s) .

Han
link this post written on 06/02/2010
  • To top
Keith Suderman:


The biggest changes, as Han noted, is that nodeSet and edgeSet are gone; they did not seem to be the most appropriate way to represent layers/tiers of annotations as they grouped "things that can be annotated" rather than the annotations themselves. The new specification allows for two methods of grouping/layering annotations:

1. annotation sets : a set of annotations with possibly some meta-data (a feature structure) attached.
2. layer : a collection of annotation sets.


Just trying to understand this. From the schema, an annotationSet element contains zero or more feature structures (for metadata) and has a 'layers' attribute which presumably indicates one or more layers that this annotation set is part of. It isn't a container as such.

A layer element again isn't a container but has a 'parents' attribute which presumably refers to the hierarchical organisation of layers. Layers don't have any associated meta-data.

I can't see anywhere that references either of these so I can't see how they then 'contain' anything. When you say annotationSet contains a set of annotations - do you mean nodes or 'a' elements?

I think it might be more useful to discuss the UML diagram than the schema. Can we get the abstraction right before the serialisation?

Keith Suderman:

How the above are used by an application to implement what they call layers or tiers will be up to the application. For example, after looking at the Elan documentation I would likely implement a "tier" as an annotation set:


As Han pointed out, unless we agree on a way to represent tiers with layers and what meta-data names to use, we won't have any interoperability. LAF needs to say something about this, if only to refer to a type registry for meta-data names.

Steve


The author has edited this post (on 06/02/2010)
link this post written on 07/02/2010
  • To top
I found an online diagram editor (Gliffy) and drew an approximation to the most recent UML model that I've seen. You should be able to access it here:

http://www.gliffy.com/gliffy/#d=1982067&t=GrAF

(you might be able to edit it via that link - if so please do)

Or as an image at http://corpho.mixxt.org/networks/images/image.162009

I've not put AnnotationSet in here as I don't understand what it is a container for. The main question is how Annotations are linked to Nodes. Annotation seems to be just a thin wrapper around a feature structure - why is it needed in that case? It adds only a label, not an id. What is the label in an annotation?

I assume that there's either a 1..1 or 1..* mapping between Nodes and Annotations.

If each Node has exactly one Annotation, what is the case for them being different objects, why not just associate a feature structure with a Node? I assume this is not the case.

If each Node can have many Annotations, what does that mean? What does an Annotation model - is it a single label (eg. a POS tag) applied to a Node or is it a structured collection of all the info describing the Node? Is having many Annotations semantically different to having a single Annotation with many features?

If each Node has many Annotations, they are differentiated only by their label, should that be a type? How do I find the POS Annotation if one exists? (Give me all the POS annotations on this Node).

Once I'm clear on these points I think it might be easier to see what a Layer should contain.

Steve

link this post written on 07/02/2010
  • To top
Hi Steve, hi everyone,

Thanks for initiating this discussion. I would very much like to contribute but have to confess that I lack some of the very basics. Could I ask some VERY fundamental questions?

1. What is the LAF standard and what is its purpose?
2. Why should we modify it to include the requirements of corpus compilers and users rather than set up a meta model first? Can the concepts of linguists be included? [Right now, I can see very little that seems to describe my concept of a phonological corpus in the UML diagram and a lot whose purpose I cannot imagine (nodes, edges...)]

Let me explain: What I as a corpus compiler and user require is a data model that can describe the following:

- a corpus consists of raw data and annotations
- raw data (audio and video files and the like) contain speech and refer to a timeline
- annotations can either be linguistic information or meta data
[I cannot see any of this represented in the UML diagrams that have been posted so far]
- the linguistic information is contained in tiers or subtiers, which are composed of elements
- tiers have or refer to meta data (who is the speaker, what is being transcribed), are composed of elements and refer to one and only one timeline; they either contain point-in-time elements or time-interval elements or both (this would be necessary for example for the transcription of gestures which have both phases and points in time - the stroke); they can contain subtiers
- subtiers have or refer to meta data and inherit the timeline of their parent tier; subtiers can contain subtiers
- tiers and subtiers contain either elements or a set of subtiers
- elements contain the linguistic transcriptions; they have a start point and (in the case of time-interval elements) a duration and refer to the timeline
- meta data can either refer to individual tiers or to a raw data file as a whole (when was it recorded, by whom..)

I'd be very happy if the other linguists joined this discussion and added to/modified my views! There should, for example, be a concept how to derive time information from one tier to another.

Best wishes,
Ulrike
The author has edited this post (on 07/02/2010)
link this post written on 08/02/2010
  • To top


>>1. What is the LAF standard and what is its purpose?

LAF is the Linguistic Annotation Framework developed in ISO TC37 SC4, Language Resource Management. Its purpose is to provide a general purpose system for representing linguistic annotations of data.

>>2. Why should we modify it to include the requirements of corpus compilers and users rather than set up a meta model first?

LAF is based on a meta model that was developed several years ago in the ISO group. The model consists of a graph of feature structures.

>> [quote from Ulrike]
Let me explain: What I as a corpus compiler and user require is a data model that can describe the following:

- a corpus consists of raw data and annotations
- raw data (audio and video files and the like) contain speech and refer to a timeline
- annotations can either be linguistic information or meta data
[I cannot see any of this represented in the UML diagrams that have been posted so far]
- the linguistic information is contained in tiers or subtiers, which are composed of elements
- tiers have or refer to meta data (who is the speaker, what is being transcribed), are composed of elements and refer to one and only one timeline; they either contain point-in-time elements or time-interval elements or both (this would be necessary for example for the transcription of gestures which have both phases and points in time - the stroke); they can contain subtiers
- subtiers have or refer to meta data and inherit the timeline of their parent tier; subtiers can contain subtiers
- tiers and subtiers contain either elements or a set of subtiers
- elements contain the linguistic transcriptions; they have a start point and (in the case of time-interval elements) a duration and refer to the timeline
- meta data can either refer to individual tiers or to a raw data file as a whole (when was it recorded, by whom..)
>>> [end of quote from Ulrike]

I think LAF and its XML serialization GrAF address all of this, at least in the latest incarnation. Our main two areas of concern as far as the needs of this group are concerned are as follows:

(1) Currently, we have an object "region" which uses "anchor" objects, and in our view, a region may be defined in terms of one or more anchor objects that specify locations (defined in terms appropriate for the medium) in the data. Steve suggests that we need a separate "instant" object, rather than regarding a region defined with a single anchor as an instant. Is this really necessary, and if so, what is the argument for it?

(2) There is a lot of confusion about layers and tiers and annotation sets. For many types of linguistic annotation, we conceive of annotation layers as potentially containing what might be regarded as several annotation sets--for example, a morphosyntactic layer may contain several different part of speech annotations, each of which could be a separate annotation set; or a discourse layer could contain both discourse relations and coreference annotations, each of which could be a separate annotation set. We are very unclear about what exactly is a "tier" in your work and how it relates to layers and annotation sets, and we would like some clarification so we can do something sensible in GrAF.


>>>There should, for example, be a concept how to derive time information from one tier to another.

I am not sure where, or if, LAF/GrAF fits into this kind of requirement, except insofar as GrAF should make it possible to do this. But I admit I am a bit fuzzy on all this!

Best,
Nancy

PS I obviously do not get how to quote notes from others and comment...I will figure it out!
The author has edited this post (on 08/02/2010)
link this post written on 09/02/2010
  • To top
Just some remarks on the latest schema (hopefully this time I have the right version, at least I can relate this one to Keith's comments: 'as' and 'a' are now children of 'graph'). There seems to be some confusion about annotation sets: there is an 'annotationSet' (in the header) and an 'as' element, documented as annotation set. What are we referring to when we write annotation set? E.g. I can imagine that 'as' elements could be used for a 'tier' (Keith, is that your suggestion?), but I don't see a feature structure element in 'as' (for 'metadata'). But there is a feature structure in 'annotationSet'.

A node is now only a container for a sequence of links, each of which can link to multiple regions. 'as' can contain a sequence of 'a' elements, which are required to refer to either a node or an edge. So, an annotation can be linked to a time interval via a node. I can't see how an annotation can refer to another annotation (in case that might be needed). An 'edge' links two nodes, which now comes down to linking two (complex) regions.
As Steve already indicated, it is unclear how 'as' sets are connected to a 'layer'.
The 'a' element has an 'as' attribute which seems to be superfluous; either the 'a' element is contained in an 'as' element and belongs to that set or it is a direct child of 'graph' and is not part of a set?

Han

link this post written on 10/02/2010
  • To top
Hi Nancy,

Thanks a lot for your reply.

Nancy Ide:


LAF is based on a meta model that was developed several years ago in the ISO group. The model consists of a graph of feature structures.

But this models only linguistic annotations, not entire corpora, right?

Nancy Ide:

I think LAF and its XML serialization GrAF address all of this, at least in the latest incarnation. Our main two areas of concern as far as the needs of this group are concerned are as follows:


Could you please provide a link to this latest incarnation of the meta model?
I am afraid that I fid it impossible to comment on your questions below without having seen it.

Nancy Ide:

(1) Currently, we have an object "region" which uses "anchor" objects, and in our view, a region may be defined in terms of one or more anchor objects that specify locations (defined in terms appropriate for the medium) in the data. Steve suggests that we need a separate "instant" object, rather than regarding a region defined with a single anchor as an instant. Is this really necessary, and if so, what is the argument for it?

(2) There is a lot of confusion about layers and tiers and annotation sets. For many types of linguistic annotation, we conceive of annotation layers as potentially containing what might be regarded as several annotation sets--for example, a morphosyntactic layer may contain several different part of speech annotations, each of which could be a separate annotation set; or a discourse layer could contain both discourse relations and coreference annotations, each of which could be a separate annotation set. We are very unclear about what exactly is a "tier" in your work and how it relates to layers and annotation sets, and we would like some clarification so we can do something sensible in GrAF.


May I repeat myself? In my view:

- tiers have or refer to meta data, are composed of elements and refer to one and only one timeline; they either contain point-in-time elements or time-interval elements or both; they can contain subtiers
- subtiers have or refer to meta data and inherit the timeline of their parent tier; subtiers can contain subtiers
- tiers and subtiers contain either elements or a set of subtiers
- elements contain the linguistic transcriptions; they have a start point and (in the case of time-interval elements) a duration and refer to the timeline



Best,
Ulrike
The author has edited this post (on 10/02/2010)
link this post written on 10/02/2010
  • To top
Hi All,

I have uploaded an image of the GrAF data model to http://www.xces.org/ns/GrAF/0.99/GrafDataModel.png The image that Steve posted is almost exactly right, minus a few relations. Unfortunately I wasn't able to edit the image Steve posted.

Steve Cassidy:

Just trying to understand this. From the schema, an annotationSet element contains zero or more feature structures (for metadata) and has a 'layers' attribute which presumably indicates one or more layers that this annotation set is part of. It isn't a container as such.

The <annotationSet> element is only used in the header to define the annotation sets. Annotation elements can then be grouped inside <as> elements, or they can use the @as attribute to specify the annotation set they belong to. Similarily, the <layer> elements are only used in the header and are used to group annotation sets.

Steve Cassidy:

As Han pointed out, unless we agree on a way to represent tiers with layers and what meta-data names to use, we won't have any interoperability. LAF needs to say something about this, if only to refer to a type registry for meta-data names.

I assume LAF/GrAF will take a similar approach to types and data categorie registries (DCR) as Relax NG does (another ISO standard). Relax NG only provides two types, "string" and "token", but provides a way to use other type libraries by specifying the library's URI. I think that is about as far as LAF/GrAF can go as well; specify a default DCR (i.e. http://www.isocat.org) and provide a mechanism for users to use alternatives if needed.

Steve Cassidy:

The main question is how Annotations are linked to Nodes. Annotation seems to be just a thin wrapper around a feature structure - why is it needed in that case? It adds only a label, not an id. What is the label in an annotation?

Yes, annotations are just thin wrappers around a feature structure, think of the label as an XML element name and the feature structure as the element attributes. The reasons for including the annotation element are:

1. The <fs> element doesn't allow for any attributes other than @type. It is unlikely we could convince the ISO Feature Structure people to add any other attributes we needed. GrAF already "tweaks" the ISO FS standard to allow the f element to contain a @value attribute, and that was a huge fight. Using a simple wrapper around feature structures eliminates that problem.
2. Consistency and ease of processing. When parsing a feature structure we do not have to distinguish between features that are annotations (this is a token) and features that are features of annotations (that token's ID is X).
3. Having multiple annotations with the same name is difficult to represent.
4. It provides for a simpler representation when an annotation does not have any features.

Annotations are now linked to a node/edge with the @ref attribute. The annotation element doesn't contain an @id because nothing in the XML serialization ever "points" to an annotation element. If an "annotation" has an ID value that would be represented in the feature structure, not the GrAF XML representation.

Steve Cassidy:

If each Node has many Annotations, they are differentiated only by their label, should that be a type? How do I find the POS Annotation if one exists? (Give me all the POS annotations on this Node).

Do you mean replace @label with @type? GrAF tries to limit the use of @type attributes because GrAF itself has no concept on "annotation type"; not all annotation formats have a concept of annotation type. Any type information should come from an external schema, DCR, or type system description.

However, annotations are also distinguished by the annotation set they belong to. So, for example, a node may have multiple "token" annotations with each belonging to a different annotation set.


Ulrike Gut:

- annotations can either be linguistic information or meta data

This is another open question (for me at least); how to best represent application specific meta-data. I believe that most meta-data should be represented in the header, and I should point out that the "header" in the above schema is only for the standoff annotation file itself; each document will also have an XCES/TEI header modified to include the LAF/GrAF specific meta-data.

However, perhaps there is a need to add another element/attribute to provide a mechanism to attach arbitrary meta-data, say a <md> element which would be a subclass of annotation, or add a @type attribute to the <a> element, where the @type would be one of 'data' or 'meta-data' (default 'data').

The author has edited this post (on 11/02/2010)
  • Statistics: 26 Posts | 6705 Visits

Sign in here

Not a member of this network?

Alternative logins

You can use an account of a third party.

Network details

  • Search for:

  • Network name

    Corpus Phonology
    Creating, searching, archiving and sharing spoken language corpora for phonological research

  • Your host is

    Ulrike Gut

  • Created on

    02/08/2009

  • Members

    178

  • Language

    English