Content Discovery tools for Commercial publishers
Identifying content in a CMS relevant to content of a published electronic document
Commercial publishers commonly research content that others have published. Such research is particularly common for content, such as for news and entertainment, published on the internet, or “online”. For example, in news and entertainment, commercial publishers need to ensure that content available from them online is interesting to potential viewers, and otherwise up-to-date.
Each major commercial news outlet typically has a “home page” on the internet on which it publishes headlines for major stories of the moment. The home page generally has many headlines, which typically are hypertext links that can be used to access a full story. In some cases, a few sentences may be provided on the home page. The organization of the headlines on the home page generally changes several times per day as new stories become available, and older stories become less frequently viewed. Thus, online content from commercial publishers, particularly in news and entertainment can change very quickly.
For a commercial publisher to identify content published online by other publishers, and to compare such content to its own resources, a challenging task is presented due to the high volume of content, rapid change of content and limited access to content. A large amount of time and computer resources can be consumed by users in reviewing online content and content stored in their content management systems.
Several technical challenges arise for a commercial publisher to compare content published online by other publishers to its own resources. In particular, the content published by other publishers is only available to the commercial publisher in its published format, such as through a “home page” of a web site.
Thus, any analysis of the published content available from another publisher is based on the structure and content of a published electronic document. A commercial publisher generally does not have access to a database of content, and various metadata about that content, owned by other competing commercial publishers. Thus, a computer-based analysis of what another publisher’s content involves extracting information based on the structure and content from a published electronic document, typically a home page.
The extracted information is used to generate queries to find relevant content in a content management system. Results from such queries are processed to communicate to a user whether the content management has content available corresponding to the query and relevant to the published electronic document, and whether the available content is published in electronic documents available from the second source.
When the query results are processed based on the relative importance of the information extracted from the published document, the communication to users can include indications of the relative importance of the query results, thus allowing the users to focus their attention on the more important content, and reduce consumption of computer resources, such as processing and network bandwidth.
Accordingly, in one aspect, a computer system receives a published electronic document from a first source of published electronic documents. The computer system analyzes structure and content of the published electronic document to extract information, and data indicative of relative importance of the extracted information. Such extracted information can include keywords, based on content, and information indicative of relative importance of those keywords, based on structure.
The computer system generates queries based on the extracted information to query a content management system of a second source of published electronic documents. The results can indicate whether the content management has content available corresponding to the query, and whether the content is published in electronic documents available from the second source.
The computer system can process these results received from queries, using the relative importance of the extracted information, to communicate information indicative of content available in the content management system and relevant to the published document and not yet published in electronic documents available from the second source. This information for several purposes to reduce consumption of computer resources, and otherwise improve productivity of users and reduce the amount of time for making content available for distribution.
In the following description, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific example implementations. Other implementations may be made without departing from the scope of the disclosure.