Enterprises are handling increasing amounts of unstructured data (electronic data that are not stored in a predefined structure, like office documents, e-mail, web info), frequently kept in repositories which have structures of limited efficiency & accessibility. Moreover the internal structure of files is usually not standardised and may not be efficient, in terms of information retrieval and reusability. According to international studies, more than 85% of business data are of unstructured nature.
The advent of web
content and the necessity to use proactively the web channel in the
market, has further increased the need to efficiently manage information
content of unstructured nature. The volume of information is rapidly
increasing, thus becoming unmanageable (info glut). The increasing need
to handle business information efficiently, in a highly competitive
environment, has driven business efforts to improve ways of storing,
retrieving, analyzing and reusing unstructured data. All relevant
efforts aim to develop a meaningful structure which shall accommodate
unstructured data. In other words to convert unstructured data to
semi-structured data: data having a higher degree of structure than the
former (not using a highly granular structure as data stored in fields
of a relational database table, however not being stored in a loosely
& ineffectively structured data repository).
techniques & technologies used to handle structured data (DBMS, SQL)
were incompatible to those used to handle unstructured data (file
servers, content management systems, collaboration tools). The term
Business Intelligence stems from the structured world while the term
Knowledge or Content management stems from the unstructured world. The
combined retrieval & analysis of information (e.g. for a Customer)
from both structured & unstructured data, has been traditionally
carried out manually. However the term business intelligence does no
longer refer exclusively to the structured data world. Convergence of
structured & unstructured data technologies, is currently
experienced. The introduction of a central data repository, can mitigate
the negative effect caused by the development of information silos.
This applies to both structured and unstructured data assets.
order to develop a structure for handling unstructured data, an
information model needs to be developed. This model has to accommodate
the needs of different user groups: customers, info users, content
authors, while being structured meaningfully: e.g. per product line, per
business process. The use of DTDs (Document Type Definition) or XML
schemas to structure content internally by introducing semantic tags,
can enhance the capability to retrieve and reuse information hidden in
documents. The use of sitemaps, meta tags and RSS feeds has being
expanding on the Web, to describe the content of sites, especially on
content which is frequently being updated (e.g. news content). RSS
allows site syndication, an approach to share content on the web, thus
increasing its accessibility & diffusion.