Zum Inhalt
Zur Navigation

Text preprocessing for Trend Mining

Preprocessing Texts by Parsing and Chunking

The aim of the prepocessing component is to automatically generate triplets (trenary, syntactic relations) from a natural language text so that RDF-Tripels can be then generated from these triplets. We evaluated several tools and methods for their capability to fulfilling our goal. First, we started with the evaluation of scientific reports and implementations concerning information-retrieval and concept mining from texts. These concepts use in general methods from computer linguistics. Almost all of these tools and papers apply to English which has has a different syntax than German. German is not as explored as English, but today several works are in progress. So we cannot use the approaches directly but as a lead to our goal. Today only proof of concepts exits, but with interesting approaches. They distinguish between statistic and syntactic processes and between noun and verb based processes.

The text in statistic processing is mostly classified with the TF-IDF (Salton) method to make an approximate statement about the content and to find to most significant words. After that, the text is sentence wise processed and nounnoun, noun-adjective or noun-verb tuples are extracted. If such a tuple occurs very often, you can say, these two words belong together. This method can be improved, if you implicate synonyms for the words of the tuples. But in these methods, the syntax of the processed sentences is not regarded, so this can lead to wrong results. However, you will find many connections between the words (low precision, high recall). The biggest advantage of statistical methods is their language independence. Furthermore, entities can be recognized with regular expressions. Call numbers or salutations have usually the same structure; you can create an expression to find all entities in a specific domain. This entities gain a higher weight and you extract the tuples over relations of these words. Syntactic methods are the other approach. Approaches from computer linguistic like POS-tagging are used here. The parsers for syntax-trees base on POS-taggers, because they have to know information about the particular words. You can identify dependencies between the words of the sentence in the created syntax trees and assign attributes to nouns (adjectives) or verbs (adverbs). Regular expressions, which are created for the particular language (e.g. X is part of Y; X, Y, Z are related to A), are also used to find relations between concepts. The biggest problem of syntactic methods is the dependency on the language to parse. A parser for English cannot be used to parse German texts. The parsers have often the problem to create the whole tree, but found relations are usually accurate. So, they have in difference to pure statistic algorithms a high precision but a low recall. The syntactic concept extraction bases on the so called tree-tagger. Our biggest problem at the evaluation was the low distribution of German in the computer linguistic. We found only two suitable parsers. First, the University of Stanford works on a project, which is available for free.

This tagger was original developed for English, but there exist projects, which port it to other languages like Chinese or German. These ports do not work satisfyingly at this moment. The other result was a commercial product from Connexor. This tagger delivers far better results than the Stanford-parser and enriches the output with much more information. He finds the time and the case of a word if it is not in its baseform. Connexor provides a web based form to test its product. The results look sophisticated. Because of the strengths and weaknesses of both approaches, both are connected to provide better results, as the use of only one of it. You can try to found entities with statistic methods and then create parse trees to find relations on the other entities.

It is very hard to test the right order of process steps or connect other sources of information like ontologies or databases. UIMA, which bases on GATE was developed to provide a framework for this. Both of them provide interfaces for which you can write your own Plug-Ins. These Plug-Ins and the built in methods can be ordered with a GUI to create a process pipeline. Architecture of our preprocessing-component:

To generate RDF-Triples from natural language texts, these texts have to be processed in several steps. For this goal we develop the preprocessingcomponent, which uses several methods mentioned before.

The texts are in XML-Format. So they are read with Xerces and the significant tags are saved. We use a SQL-database, in our case HSQLDB for a later access to the texts and other caching purposes. If an actual version of our program is released, no database has to be set up, because everything necessary is stored in our version. The database is also used to store the parse trees 32 and the baseforms of the found words, because the creation of them takes a very long time. When we read a file, it runs through several steps. First, all abbreviations are replaced by their long form to avoid faults, e.g. the trailing point. This component is very rudimentary, because not many abbreviations are stored. In a later release several of them are stored in the database. After this first step, an other component which recognizes proper nouns and replaces them with pseudonyms processes the text. Proper nouns lead to failures in tree taggers. When the text is prepared in this way, the semantic analyzes can be started. At this state we use two different taggers, the Stanford-Tagger which is free and the demo version of Connexor. The Stanford tagger has the problem not to be very performant. He needs for a simple sentence of 20 words about 30 seconds and needs about 1 GByte RAM to fulfill its task. Even then, he can often not recognize the sentence structures properly. He cannot normalise word, we use the web service provided by Wortschatz Uni Leipzig. The results of these queries are stored in the database mentioned above. As other tagger we use the demo version of Connexor. This tagger provides far better results than the Stanford tagger, has a much better performance and normalizes the founded words. But the tagger is not free for use. Again, the calculated tree will be stored in the database for later access. Once all the texts of the corpus have been read, the IDF for each word is calculated, which later is used by us in a later processing step.

To show the results, we use Timeline and Timeplot both of them provided by SIMILE. Timeplot is useful because we want to show the documents by their dates and the share price of a company want to display. If you want a specific text to display, the TF of its terms is calculated and with the IDF of the terms a TF-IDF ranking on the content created. The parse tree belonging to the document is dynamically created using JavaScript.





  • Iavor Jelev: Preprocessing of documents for Emergent Trend Detection in text Collections, Diploma Thesis, FU Berlin
  • Mike Rohland: Generierung von semantischen Relationen aus Tags innerhalb der Folksonomien, Master Thesis, FU Berlin
  • Ievgeniia Ozeran: Trends und Web: Themen, Zeit und Texte, Master Thesis, FU Berlin
  • Diana Olivera: GUI für wissensbasierte Trendanalysen, Bachelor Thesis, FU Berlin
  • Lars Wißler: Trendontologien für wissensbasierte Trenderkennung - Erweiterung und Test, Bachelor Thesis, FU Berlin
  • Joachim Daiber: Candidate Selection and Evaluation in the Entity Extraction System DBPedia Spotlight, Bachelor Thesis, FU Berlin
  • Olga Streibel: Knowledge Based Trend Mining, Phd Thesis, FU Berlin

1st International Workshop on Mining the Future Internet!

MIFI2010 workshop

Trend Mining corpus (in cooperation with neofonie GmbH)

Texts from our Trend Mining corpus are stored in 2 data bases: finance and mafo.Finance has 21 tables named by the document sources: chats, information boards, etc.  Finance corpus consists of 276,587 documents. Mafo has 27 tables named by the document sources: chats, information boards, etc. Mafo corpus consists of 74,145 documents.

Timeline with parsed text stream (Mozilla Firefox)

Timeline plotted with stock values (Mozilla Firefox)

Chemisches Zentralblatt corpus (in cooperation with FIZ Chemie GmbH)

We develop approaches for semantic preprocessing of chemical texts from 19th century.

Timeline plotted with text stream from Chemisches Zentralblatt

Address extraction using a Hidden Markov Model


Olga Streibel (contact)
Dennis Hartrampf

2012-12-15 10:33

Abschlussveranstaltung des Corporate Semantic Web Projekts

5 Jahre Corporate Semantic Web Abschlussveranstaltung am 16.1.2013

Weiterlesen …

2012-12-07 18:20

CSW active in OMG API4KB Standardization

API4KB is an initiative within OMG that aims at defining a standard programming interface for knowledge bases

Weiterlesen …

© 2008 FU Berlin | Feedback
This work has been partially supported by the  InnoProfile-Corporate Semantic Web project funded by the German Federal Ministry of Education and Research (BMBF) and the BMBF Innovation Initiative for the New German Länder - Entrepreneurial Regions.
doctor death jack kevorkianmetronidazole