Wednesday, March 10, 2010

Updated Profile: Calais

[Originally published 12/10/09]
The Calais Initiative

Company: Thomson Reuters, Inc.
URL: http://www.opencalais.com/
HQ: New York, USA
Products (Primary): The Calais Initiative
Survey Respondents: Tom Tague, Krista Thomas
Vendor Category: NLP

Employees: 50,000
Revenue: US$13.94 Billion
Calais installed base: 7,000 developers, 2,000,000 pieces of content processed per day

Primary Offering:

The Calais Initiative (Calais) comprises several tools for processing text, but the core product is a Natural Language Processing (NLP) engine. When presented with a body of text, the Calais Web service returns the “named entities” (the categories to which the document’s key terms and concepts are assigned), facts, and events it discovers within the document. The relationships between these items are also identified and embedded in the results. Essentially, the results are the Semantic metadata of the document and can be thought of as the document’s “knowledge content,” which can be published and made available for searching and navigation.

On its own, and applied to one or two small, short documents, this might not seem terribly valuable. But deployed on the Web and made available as a free service, Calais is in a position to process massive amounts of data (text, quantitative, graphic, etc.) and extract their knowledge content. Once this task is complete, this content can be searched individually or combined with other similar content and searched in a larger context. This larger context can be based on other Web content, proprietary Thomson Reuters content, a combination of the two or the context of select data sources that may address a specific area of interest.

Ultimately, Calais’s goal is to be the world’s best tool for extracting the structure of any kind of content, recognizing its type, the concepts that are contained, their relationships, and doing so not just within a single file, but across a span of files that could be as large as the Web itself.

Key Differentiators:

Demand from large organizations, including well established publishers, has grown at an unexpectedly high rate. This has led Thomson Reuters to introduce three contract-based versions of Calais in addition to the original free service:
  • Calais Professional - same as the free service but now backed by an SLA and with higher transaction limits.
  • Calais Professional for Publishers - Calais Professional tailored to meet the needs of large scale publishers and tied to an annual contract.
  • ClearForest On-Premise Solutions - ClearForest is the original name of the technology that makes Calais work. Now that it's available as a stand alone application, enterprises will be able to closely tailor the service to their needs, ensure the privacy of their proprietary content and also have access to what's under the hood for even further customization.
Thomson Reuters is another key differentiator – the fact that Calais is sponsored by a global information giant suggests that this entrant will be with us for a long time. Furthermore, at this time Calais is in the final stages of testing its “infinite scalability” initiative, (e.g., cloud computing) designed to address growth in demand and/or spikes in utilization.

Another distinguishing characteristic is the rate at which the service has been adopted (the fact that it’s free is worth repeating). The net effect has been to discard the original projections for usage because demand has so vastly exceeded expectations. Note that until very recently, dema
nd for Calais has existed almost entirely outside of any Thomson Reuters media property. This state of affairs is changing rapidly, with internal inquiries arriving with greater frequency.

Deploying Calais against the vast, professionally developed and controlled content in the Thomson Reuters empire would be a remarkable step in the company’s evolution. After 150 years as a traditional news wire service and publisher, Thomson Reuters’ content could quickly become something not yet fully defined, but possibly far more powerful and useful than what traditional publishers have offered before.

Six/Twelve Month Plans:

In January ’09, Calais is scheduled to launch Release 4, which will open the door to the world of “Linked Data,” a critical step toward fulfilling the promise of the Semantic Web. Essentially, URIs (Uniform Resource Identifiers) allow for the linking of individual data elements, a concept that goes much further than linking containers like files, pages, documents, or databases as we’re accustomed to on the WWW. The Semantic Web term for each pointer that leads to a datum is “dereferenceable URI”.

Wikipedia does a nice job of explaining references and their consequent dereferencing by using house addresses and houses. In this case, a house address is the reference, or pointer. Using this pointer and finding the actual house is the same as dereferencing the address.

In Calais’s case, after extracting the entities (e.g., people, places, companies, etc.) from your content you could then link to (or retrieve for processing by an application) relevant data on DBpedia, The CIA World Fact Book, Freebase, or a rapidly growing number of other compatible data sources. If you’re a talented content producer, the additional leverage that comes from linking to these “external” data could make your offering substantially more useful and in turn, much more valuable.

Let’s build on the example above, where the entities in an original document have been linked to data residing on DBpedia and The CIA World Fact Book. The idea is that the entities extracted from each source can be linked manually, through search results, or as a result of processing by an application. Simply knowing that these entities have an association can be valuable, but the key is that the URI provides a pointer to the specific data – not the file, not the document, and not the database, but to the actual datum, value, or record that’s stored in one of these containers. There’s no longer a need to call an entire file or database, read it to find what you’re looking for and then put it to use. Instead, you call just what you need – the specific data that matter to you.  

This process is faster (read: cheaper in computer processing terms) and those URIs you’ve amassed can be reused by other people and applications because these pointers are durable and they persist – if the data remain in place, then each datum will keep the same individual URI (again: cheaper, highly reliable, and standardized to ensure universal access and use). It’s simply easier to exchange pointers to specific data (dereferenceable URIs) than it is to exchange potentially huge data files or documents.

Once documents and information assets are connected to the Linked Data cloud, deep connections can be made between the entities, facts and events therein. This can, for instance, enable the resolution of complex queries, such as: “Which company boards of directors include CEOs that have been involved in the sup-prime mortgage meltdown?”.

The diagram (not the datasets) is CC-BY-SA licensed. Email comments to Richard Cyganiak at richard@cyganiak.de

Analysis:

Let’s start with the premise that Thomson Reuters has 150 years of experience creating, managing, and presenting content that people want. Over this period, the company has amassed a body of high quality content that’s possibly the largest in the world. This content will continue to grow, but the advent of the Web has unleashed a torrent of content on a genuinely planetary scale. Since this content is outside Thomson Reuters editorial and/or production controls, the company considers it to be “wild” content. This doesn’t mean it’s bad – some of it’s exceedingly good.

Based on the environmental factors below, Calais puts Thomson Reuters in a position to extend its core competencies to include content it controls as well as wild content because:
  • The fundamental nature of publishing and using content is changing.
  • “World Wild Content” will dwarf the content Thomson Reuters controls.
  • Professionally produced content will continue to merit a premium.
  • The Open Access movement and similar efforts by academics, researchers, and other content authors seeking to retain control of their work will continue and grow.
  • Thomson Reuters has extensive experience in every aspect of the content industry.
  • Flexible integration/interoperation of different types of content may provide powerful added value.
Calais is a free service that stands to significantly benefit people and organizations around the world. The terms of use may vary to allow Calais rights to utilize the content’s metadata or not, but unless you’re a major publisher, this won’t be much of an issue. What matters, at least to Thomson Reuters, is that Calais is a very concrete step toward organizing and integrating the vast span of wild content with its own high quality content. Offering customers your own content combined with the very best of free, Web-based content in an easily searched, highly flexible and exceptionally expansive product is a strong competitive advantage that may ensure another 150 years of operation. This is the strategic thrust of The Calais Initiative.

Thursday, January 7, 2010

Company Profile: IYOUIT

Company Profile: IYOUIT

Company: IYOUIT URL: http://www.iyouit.eu/portal/ HQ: Munich, Germany Products (Primary): IYOUIT Survey Respondents: Matthias Wagner Vendor Category: R&D Project

Employees: -- Revenue: -- Installed base: --

Primary Offering:

The only reason IYOUIT isn’t a runaway global success is because it’s still a research project supported by NTT DOCOMO and the Telematica Instituut.

IYOUIT is a very deliberate effort to explore the use of Semantic Web (SW) technology in mobile environments. IYOUIT integrates a wide range of services such as GPS location, location-based points of interest, picture sharing, local weather, messaging, and more. Some of these data are user generated, while other data are generated automatically, and the application goes even further by connecting to services like Flickr and Twitter. Furthermore, the rich mobile experience is complemented by a Web site (https://www.iyouit.eu/portal/) that displays real time updates from IYOUIT users around the world.

The IYOUIT client is made to run on mobile phones that use the S60 operating system, which means just about any high end phone made by Nokia, LG, or Samsung, along with a few models made by Lenovo and Panasonic. The client is lightweight and its interaction with the network has been tuned to minimize the amount of data passed back and forth. This decision was made deliberately to reduce the impact on subscription plans that charge based on device throughput. Processing demands at the device level have been calibrated to reduce overhead while reasoning, ontology management, and processing-intensive functions occur on the network.

Launched June, 2008, IYOUIT’s user base is still small as these things go – in December, 2008 the project has roughly 1,000 users distributed across 50 countries, with most users concentrated in Europe.

Key Differentiators:

When you’re one of a kind, it’s difficult to contrast with existing products, but some fundamental (and remarkable) qualities include the fact that this application works, it’s available for download right now, it genuinely uses SW technology and it’s made for mobile devices. IYOUIT seamlessly combines the mobile experience with context based enhancements delivered by the network and users can even set “triggers” to be alerted when specified conditions are met, e.g., while you’re at your favorite coffee shop you can be alerted when one of your IYOUIT buddies arrives.

Six/Twelve Month Plans:

As a research project, IYOUIT serves as a learning environment and isn’t necessarily tied to commercial delivery schedules. Nonetheless, the team behind IYOUIT certainly has plans and one that could be discussed is the creation of a developer connection. If this effort succeeds, it’s easy to imagine the creation of more applications and in turn, growth in the user base. In fact, the IYOUIT team is counting on open participation and they’re looking forward to new discoveries.

Analysis:

IYOUIT is much more than an intriguing mobile SW application, so let’s broaden our context (fitting, isn’t it?).

  • While the application is presently geared for relatively high-end phones, all those phones use the S60 operating system originally created by Nokia.
  • Presently, Nokia holds about 40% of the global mobile device market and even if this figure is adjusted to reflect just the higher end of Nokia’s product line, that’s still a lot of phones.
  • Samsung, LG, and others combine to increase the potential user base even further.
  • Nokia has a history of SW research and development that dates back to roughly 1996 and equally, the company has a long history of participating in the open source community.
  • Nokia’s recent SW research seems to focus on the creation of application development tools (http://research.nokia.com/research/projects/), which would play into the promise of IYOUIT very nicely.
  • Nokia’s stated corporate strategy is based on its device business, mobile content, and network infrastructure. Offerings like IYOUIT could be a big win for NTT DOCOMO, Nokia, and just about anyone else who can get involved.
  • NTT DOCOMO is based in Japan and while it’s a cliché at this point, the Asian countries are probably still well ahead of the rest of the world when it comes to developing, deploying, and using mobile technology.

Put these factors together and IYOUIT begins to look like the tip of an iceberg – one that will mean big wins for NTT DOCOMO and other global companies and likely, big wins for innovative startups that create valuable products and services for an environment that’s increasingly ready-made to receive them. Wow!

Monday, January 12, 2009

Company Profile: Nstein

Company Profile: Nstein

Company: Nstein URL: http://www.nstein.com/ HQ: Montreal, Canada Products (Primary): Web Content Management, Text Mining Engine, Digital Asset Management, Picture Management Desk Survey Respondents: David Crouy, Christopher Hill Vendor Category: Vendor

Employees: 200 Revenue: $24M Installed base: 115+

Primary Offering:

For the past two years Nstein has been on a mission to integrate its Text Mining Engine (TME) with the Content Management System (CMS) it gained in its acquisition of Eurocortex. After honing TME for several years prior to acquiring Eurocortex, Nstein discovered that many customers either didn’t know what to do with the resulting metadata or there was no way to use the metadata in existing CMS products. By combining Natural Language Processing (NLP) and CMS, Nstein believed it could pursue significant business opportunities and the company’s client list certainly supports their intuition.

TME is a full-blown NLP product capable of extracting and categorizing the metadata contained within documents, specifically the people or organizations, places, and events that are mentioned in these files. Add-on modules are available that provide document summaries, detect sentiment, search for similar documents, as well as topic clustering. The net effect is that customers can process their documents, categorize and provide search capability within the results, and support their Search Engine Optimization (SEO) efforts.

For publishers, using metadata to create links to relevant content (their own or third party) and a search engine “friendly” Web site is important in capturing incremental revenue. In fact, Nstein actively positions itself as a company that seeks to create additional revenue opportunities for its customers – more on this in a moment. It’s not unusual for many or most publishers to rely on human editors to create categories, relevant links, and manage the posting of these links to the title’s Web site. Obviously, paying people to perform this task can become expensive at a time when the publishing industry is already experiencing a high degree of turmoil.

Solutions like Nstein’s and others can help to reduce the expense of human tagging or even introduce tagging where there’s been no human available to perform this task. Utlimately, costs can only be reduced to zero and businesses rely on revenues and profits for success. Nstein is cognizant of this fact and tries to point its customers in the right direction – for example, once a publisher’s content has been tagged it can be tailored to produce a feed based on a person, place, or thing. For some publishers this can represent a new and very welcome revenue stream. Another example is a common trait of NLP technology, which is the publication of additional content links that are related to the primary article on a given page. Again, some publishers will find the resulting performance an improvement over their current state of affairs.

Key Differentiators:

Aside from being the only CMS platform (or at least one of the very few) to have integrated NLP, and the company’s active focus on creating revenue opportunities for their customers, Nstein’s products include a “Picture Management Desk” designed to manage, tag, and categorize high volumes of images. In fact, a series of virtual desks can be set up to process images depending on the inbound channel.

Six/Twelve Month Plans:

For the moment, Nstein will continue its focus on improving its text mining and associated performance. The company has additional goals to extract even more information where possible and it plans to begin exploring the use of “analytical” queries such “Who was involved in car crashes over the last six months?” or “How many times was John Doe mentioned in articles related to crime?”

Analysis:

Nstein clearly has a head start on most, if not all other vendors in the Content Management System game. Its combination of CMS and NLP is a natural evolution in the management and delivery of Web content and any customer seeking a CMS solution generally would do well to take a close look at Nstein’s products. Far from being an early stage company, Nstein has proven itself during the lull after the dotcom bust, which makes a very positive statement about the fundamental value the company offers. Add to this track record the combination of CMS, NLP, Picture Management, and Digital Asset Management, and a broad range of possibilities become quite clear – both for Nstein and its customers.

Monday, January 5, 2009

Company Profile: Zemanta Ltd.

Company Profile: Zemanta Ltd.

URL: http://www.zemanta.com/ HQ: Ljubljana, Slovenia Products (Primary): Zemanta Web Service Survey Respondents: Andraž Tori Vendor Category: Deployer

Employees: -- Revenue: -- Installed base: --

Primary Offering:

Zemanta has leapt onto the Semantic Web stage by launching its NLP-based service for bloggers and other content producers. The net effect is that for users, it’s like having a co-writer constantly suggesting related articles, links, and images, and then wrapping up with a set of recommended tags designed to increase search engine discovery. The steady stream of suggestions provides an abundant stock of references that can be used to enhance an article, blog post, etc. It’s easy to imagine how anyone would find these helpful, and equally, it’s easy to imagine a productivity boost as well.

Zemanta presently comes in several different flavors including add-ons for Firefox and Internet Explorer, a plug-in for Windows Live Writer (a Microsoft desktop application designed to publish content directly to many popular blogging services), a server-side plug-in for those hosting WordPress, Movable Type, or Drupal implementations and finally, as of December ’08, Zemanta offers fee-based access to its API (http://www.zemanta.com/api/) capabilities for automatic in-text linking, categorization, related news, related images, tagging and linking to other semantic databases. By covering each one of these bases, Zemanta has positioned itself to reach just about anyone who has an interest in blogging, writing, or content creation generally, anywhere in the world.

Recognizing the broad nature of its potentially vast user base, Zemanta’s solution is tuned for casual writers who may have looser writing styles when compared to professional columnists or authors. In fact, Zemanta’s focus on individual content producers may prove to be a key strategic decision – most, if not all, NLP products are geared toward team or institutional environments where there may be larger goals and needs to be served. Becoming a valued tool for individual content producers is a very different strategic aim and Zemanta may have wisely selected a market with few, if any, entrants.

Key Differentiators:

Just trying Zemanta is enough of a differentiator – it’s one thing to have the desire to enhance your content but it’s very different and far better to have ready made suggestions at hand and immediately usable. If you’re a content producer, writer, or just someone looking for helpful suggestions when you write, you may be delighted to have Zemanta’s assistance.

Other differentiators include plans to broaden the delivery of Zemanta’s service through more familiar applications and the ability to harvest thoughts and suggestions from emerging repositories of linked data, but these two points require more time and development prior to their general availability.

A critical factor in setting Zemanta apart is that even the most computer-challenged user can take advantage of this solution. There’s no need for corporate IT personnel to get involved, no approval process, etc. – simply download the add-on to your browser, visit your favorite blogging site and start writing. This kind of simplicity destroys a number of barriers to adoption and Zemanta deserves credit for taking this approach.

Six/Twelve Month Plans:

A commercial version of Zemanta is planned which will be made available on a Software as a Service (SaaS) basis. This version will be targeted toward professional content producers, likely those found in the publishing industry. Once launched, more specialized tools will be soon to follow any offerings to the publishing industry. Zemanta also has plans to apply their solution to more than blogging, although for the time being the company prefers to keep these plans private.

Analysis:

Zemanta is an extremely practical and useful tool that writers of all kinds may find helpful. It only takes a few moments to recognize the potential value of this service, not to mention the sheer helpfulness of having something as tedious as tagging performed automatically.

Returning to company’s market entry point, the decision to pursue individuals initially allows it to acquire recognition, a user base, valuable knowledge from real-world experience, and time to hone its offering to razor sharpness before entering professional/corporate markets. These markets will be pursued primarily in the US and UK, which should certainly keep Zemanta busy for some time to come. Writing may never be quite the same again.

Wednesday, December 3, 2008

Changes At The Semantic Business Blog

If you're a regular reader of Semantic Business you'll recognize that things have been changing - I've added a jobs feed, some of the formatting is slightly off (to my chagrin), and I'm planning to start serving ads in a relatively unobtrusive way. All of this is a precursor to my move over to TypePad (preview here), which looks much better suited to my needs going forward. Please bear with me during this time - this isn't a random decision and I'm aware that links are likely to be broken, things may be lost (hopefully just temporarily), and I may spill my coffee. Or not.
I'm also planning to forward my domain www.davidprovost.com and consolidate my online presence. We'll see how this goes...