Sedna – Native XML Database System
Full-featured native XML DBMS with support for the W3C XQuery language.
XML is standard for storing and exchanging information in the Web. In order to facilitate work with big amount of XML data we developed special DBMS system that is called Sedna.
Sedna is a free native XML database, which provides a full range of core database services – persistent storage, ACID transactions, security, indices, and hot backup. Flexible XML processing facilities include W3C XQuery implementation, tight integration of XQuery with full-text search facilities and a node-level update language.
As demonstration of Sedna work, you can see our WikiXMLDB project.
Texterra – Text Mining Toolkit.
Texterra is a toolkit for text mining based on novel text processing methods that exploit semantics extracted from Wikipedia. Texterra delivers a solution for organizing and monitoring collections of documents without the expensive customization that is present in contemporary systems.
We use Wikipedia as a knowledge base to facilitate text mining and semantic search in arbitrary documents (not in Wikipedia but in news, blogs, etc). We mine the graph of Wikipedia links to compute semantic similarity between all Wikipedia terms. As a result, we build a semantic graph of terms with more than 3 million nodes (for comparison Britanica contains 65,000 terms). We then exploit this graph to interpret meanings and relationships of terms in text documents. Using the Wikipedia-based semantic similarity, we can build the semantic graph for a text document. We then analyze the document graph with community detection algorithms to select keywords, infer topics, etc. It allows us to go beyond improving existing techniques and to provide new functionality such as thematically grouped keywords and meaningful hierarchical ontology (tag cloud) that describe a collection of texts. It also allows us significantly improve search and navigation experience by semantic-aware result ranking (considering the meaning of terms and the number of related terms in a document) and query-related faceted navigation
Blognoon – Content Exploration system
Blognoon (http://blognoon.com) is an innovative content exploration system based on Texterra technology. Blognoon provides following functionality:
- semantic-powered search
- navigation and exploration
- automatic resource description
Blognoon data model consists of two concepts: content and sources. Content is some information items; sources produce content. For instance, Blognoon demo utilizes number of predefined web logs that we crawl from the web as sources, and uses post in these blogs as content. Other examples of content-source pairs could be songs-singers, movies-studios, books-writers and so on.
User could search both content and sources and results are ranked with the aid of semantic similarity, that make them high relevant to user’s query. As well, system allows user navigate through search results and refine original query with the aid of original facet-enhanced interface.
In addition, system gives opportunity to explore domains with the aid of automatic resource description and recommendation tools, based on semantic graphs of documents.
Content and Knowledge Management Framework
Customer: Great Russian Encyclopedia Publishing Company The framework provides full-life cycle content and knowledge management services that are used to develop advanced information products based on encyclopedias and references. Our Sedna XML Database is the core component of the framework. It provides a single-sourcing publishing, powerful content reuse, superior search & navigation, and great flexibility in information products customization.
NLP Tools for Taming Information Explosion
Main goal of the project is a development of new tools for natural language processing (NLP) and applying these tools to analysis of text documents. Current research is devoted to (a) Information Extraction over Tables, Lists and other Enumerations and (b) Inference of Attribute and Entity Hierarchies. This project is supported by HP Labs Innovation Research Program.
D-test – WSD Tests Creation System based on Wikipedia
Main goal of this system is to create corpus for word sense disambiguation by allowing users mark text with the aid of user-friendly interface. This project uses Texterra for preprocessing documents.
This project provides a way of querying Wikipedia with XQuery. We have parsed Wikipedia content into well-structured XML representation, loaded it into Sedna XML database and implemented an XQuery Web interface.
The toolset implements functional techniques for processing XML data in the Scheme programming language.
Micro-blogosphere has unique features: It is a source of extremely up-to-date information about what is happening in the world; It captures the wisdom of millions of people and covers a broad range of domains: from US president inauguration to album release by a little known music band.
Twitter, the most popular micro-blogging tool, therefore we developed a Twitter stream analysis system, called TweetSieve.
In TweetSieve demo, the user is able to express the subject of her interest by an arbitrary search string. The system shows the period of events occurring for the subject and outputs tweets that best describe each of the events. The demonstration illustrates the potential of micro-blogging analysis approach in bringing news acquisition to a new level of promptness and coverage range. This system is described in paper Sifting Micro-blogging Stream for Events of User Interest.
BizQuery – XML-based Virtual Data Integration System (2000 – 2003)
BizQuery is a package of servers and tools for application development in presence of heterogeneous data sources. The main component of the package is BizQuery Integration Server, which is for querying across multiple heterogeneous databases in XQuery language. BizQuery Integration Server supports the notion of global schema defined in XML. A global schema is created to represent a particular application domain and data sources are mapped as views on the global schema. BizQuery supports virtual approach: the user asks a query over the global schema and the data integration system reformulates this into a query over the data sources and executes it. To get more information, read BizQuery overview and visit our publication page.
SP C++ ORB is a free tool for development of distributed software. ORB plays the role of communicator between different components of distributed applications which can run on the different platforms. ISP C++ ORB is compliant with OMG Common Object Request Broker Architecture 2.0 (CORBA 2.0) standard. Implementation of IDL/C++ mapping superstructure developed by our group is also included. This superstructure can be applicated to any CORBA 2.0 compliant C++ ORB. It provides possibility to increase reliability and usage convenience of IDL/C++ mapping. Implementation of ISP C++ ORB does not rely on new features of C++ language (such as exception handling and namespaces). And so it can be compiled by various C++ compiler versions (checked for g++ 2.7.2, 2.8, egcs). It can be installed on main Unix platforms and also on Windows 95/98/NT under CygWin (checked for version 20.1). This implementation supports both single-thread and multi-thread environments (CygWin doesn’t support posit threads, so only single-thread variant for Windows platforms is available). ISP C++ ORB can be installed and used in both modes: as shared or not shared library. You can download ISP ORB here.
GNU SQL Server is a free portable multi-user relational database management system. It supports the full SQL89 dialect and has some extensions from SQL92. GNU SQL Server implements highly isolated transactions, and static & dynamic query compilation. Both, client & server sides of the system work on Unix-like systems. Client/server interaction is based on an RPC mechanism. The server sub processes facility requires message passing and memory sharing facilities. More information on the system and downloads can be found at GNU SQL Server home page.