Skip navigation.
Home

Extractor

The extractor is a simple command-line python program built on the BioInfer basic classes framework. The program can be used to extract different aspects of the corpus annotation in a simple human-readable format. In addition to providing simplified access to the corpus annotation, the extractor serves as an example on how to use the basic classes to transform the corpus annotation into other formats.

Class documentation

Detailed class documentation of the extractor in Epydoc format can be found from this address by selecting "extractor" from the Module Heirarchy.

Usage

The extractor is a command-line program, and is invoked as

python extract.py -b CORPUS OPTS

where CORPUS is the corpus data file and OPTS specify the extraction options. A full list of the extraction options can be viewed by running

python extract.py -h

which prints the extractor help.

Example: extracting corpus text

The extractor help specifies, for example, that the "-t" option can be used to output the original, untokenized text of the sentences. Running

python extract.py -b CORPUS -t

results in the following type of output:

[...]
14:TXT: Acanthamoeba profilin was cross-linked to actin via a zero-length isopeptide bond using carbodiimide.
15:TXT: Accordingly, beta-catenin is also found in these structures, again in the absence of alpha-catenin.
16:TXT: A chimera consisting of the alpha-catenin-binding region of beta-catenin linked to the amino terminus of alpha-catenin 57-264 behaves as a monomer in solution, as expected, since beta-catenin binding disrupts the alpha-catenin dimer.
[...]

This general format of the output is the same for all extraction options. The output contains a single line for each extracted category of data, and each line has the structure

SENTENCE:TYPE:DATA

where SENTENCE is a unique identifying number given to the sentence, TYPE is a short unique textual identifier for the type of output, and DATA is the extracted data corresponding to the specified sentence and data type.

Several output options can be specified on a single invocation of the extractor. For example, the "-t" and "-k" options (the latter extracts tokenized text with tokens separated by whitespace) can be combined as

python extract.py -b CORPUS -t -k

and result in output of the following type

[...]
22:TXT: ActA appears to control at least four functions that collectively lead to actin-based motility: (1) initiation of actin polymerization, (2) polarization of ActA function, (3) transformation of actin polymerization into a motile force and (4) acceleration of movement mediated by the host protein profilin.
22:TOK: ActA appears to control at least four functions that collectively lead to actin-based motility : ( 1 ) initiation of actin polymerization , ( 2 ) polarization of ActA function , ( 3 ) transformation of actin polymerization into a motile force and ( 4 ) acceleration of movement mediated by the host protein profilin .
[...]

Here different output types for one sentence are given on consecutive lines, with the original text lines identified by TXT and the tokenized text lines by TOK.

Extracting entities

A key extraction option is "-e", which extracts the entities annotated in the text as a space-separated list in the format

TYPE(ID, [CHAR_OFFSETS], 'TEXT')

where TYPE is the type of the entity (as defined in the entity type ontology), ID is the unique identifier of the entity, CHAR_OFFSETS is a comma-separated list of FROM-TO character offsets specifying where in the untokenized sentence text the entity is found, and TEXT is the text of the entity. The character offsets are given as a list of offsets as the entities may be discontinuous in the input text.

Example output:

10:ENT: Individual_protein(10.a, [0-20], 'Acanthamoeba profilin') Physical_property(10.b, [45-57,74-78], 'properties of actin') Individual_protein(10.a1, [13-20], 'profilin') Individual_protein(10.b1, [74-78], 'actin')

Extracting relationships

Another important option is "-r", which extracts the relationships annotated in the sentences as a space-separated list in the format

TYPE([CHAR_OFFSETS], ARGS)

where TYPE is the type of the relationship (as defined in the relationship type ontology), CHAR_OFFSETS is a comma-separated list of FROM-TO character offsets specifying the text binding of the relationship in the untokenized sentence text, and ARGS is a comma-separated list of the arguments of the relationship. The arguments of the relationship can be either entities or other relationships. Entities are represented in the arguments by their unique identifiers and relationship arguments by their definitions, as above.

Example output:

16:REL: BIND([42-48], 16.f, 16.a) CROSS-LINK([73-81], 16.f, 16.e) DISRUPT([200-207], 16.c, 16.d)

Extracting syntactic annotation

Using the "-d", "-p" and "-s" options, the syntactic annotation of the sentences can be extracted as a space-separated list of dependencies output in the format FROM-TO[TYPE], where FROM and TO indices into the tokenized text of the sentences (tokens are indexed in sequential order beginning from 0) and TYPE is the type of each dependency. The syntactic annotation follows the Link Grammar dependency scheme, and documentation of the types can be found on the Link Grammar Documentation page.

The "-d" option extracts the "basic", unexpanded token-token links. NP macro-dependencies are identified by the type "macro" in the output. The "-p" option extracts annotation where NP macro-dependencies have been expanded in "parallel", so that each of the noun premodifiers spanned by an NP macro-dependency connects directly to the head word. By contrast, the "-s" option exctracts annotation where NP macro-dependencies have been expanded "serially", so that each noun premodifiers connects to the next noun premodifier in sequence (with the last connecting to the head).

Example output:

13:DEP: 6-7[macro] 3-4[MVp] 3-10[Osn] 9-10[COORD] 17-18[UT] 12-15[Jp] 2-3[Em] 15-16[Mp] 8-10[Dsu] 1-3[Ss] 4-7[Jp] 18-19[A] 0-1[macro] 16-19[Js] 13-15[macro] 10-11[COORD] 10-12[Mp] 5-7[Dsu]

Specifying your own extraction actions

As the python code of the extractor is built on the BioInfer basic classes framework, many of the extraction functions are extremely simple and easy to modify to output other formats. For example, the function for printing the tokenized text of the sentence is defined simply as follows:

def printTokenizedText(sentence):
    for t in sentence.tokens:
        print t.getText(),

Modification and adaptation to other purposes is straightforward and should be possible even for users with little prior knowledge of Python. The extensive class documentation can be used to discover the different ways of accessing the basic classes object hierarchy representing the BioInfer corpus annotation.