SEARCH ME MAGAZINE
SEARCH FULL ASME SITE
SEARCH



Question of the Month

ASME Strategic Roadmap

White Paper Library

Webinar

A Search Engine for Product Design That Clicks
WEB EXCLUSIVE

By Joseph Engler and Andrew Kusiak


Andrew Kusiak is a professor of mechanical and industrial engineering at the University of Iowa. Joseph Engler is a senior software engineer at Rockwell Collins Inc. and an industrial engineering Ph.D. student at the University of Iowa.


1. Introduction

Innovation is the process of developing a new product or service that is market-worthy. This process can take different forms. Continuous innovation, for instance, is concerned with the incremental improvement of existing products or services by the modification of certain attributes of that product or service—for example, development of the next lawn mower doubling the sales volume of the previous-generation mower. Disruptive innovation is the creation of a totally new product or service. A cell phone and an iPod are recent examples of such products. Any product, service, or process that becomes innovative is deemed so only through acceptance in the global marketplace. Companies spend vast amounts of money to ensure that a new product becomes a success. From focus groups to market research, the aim is to predict the success of a concept before it is released to the general public.

Determining market acceptability of a new concept during its early development stage is difficult and time-consuming. There is no well-established methodology or tool that would guarantee such success. Ulwick [1] states that the traditional approach of asking customers for solutions tends to undermine the innovation process, because of the limited time frame of reference of most customers. Yet, corporations spend millions on increasing the market acceptance of their products and services. While Ulwick [1] does have a justifiable case for limiting reliance on customer input, there is an obvious advantage to judging that a product or service is market-worthy. In contrast to Ulwick [1], von Hippel [2] argues that research in innovation suggests involving customers. Blazevic [3] suggests that electronic interaction with customers allows for a mutual understanding during the innovation process.

Focus groups and market research are often limited to a select group chosen either randomly or with segmentation in mind. This process is suboptimal and may offer a false sense of security when releasing a product or service because of the limited size of the sampling. As in data mining, the larger the sample size for testing, the better the model will be. This article extends the scope of the focus groups and marketing research tasks to the entire World Wide Web.

The World Wide Web can be seen as a massive search space. According to Cyveillance [4], as of 2005 there were over 63 million registered Web domains, 25 million of which were alive at any given time and 1.2 million of which were commerce centered. Additionally, they reported over 8 billion Web pages with some 7.3 million unique pages being added daily. When the comparison is made with the Web, it is easy to see the limitations of a focus group of 1,000 individuals.

This article proposes a method to mine data from the World Wide Web to discover customer and expert reviews of products and services. These reviews are used to create a transactional database from which frequent requirements for innovation may be mined. A data mining algorithm known as an association rule algorithm is utilized to analyze the frequent requirements. The data-mined requirements can serve as a road map to success, in the knowledge that a larger understanding of market conditions has been surveyed. Having such a road map of requirements is especially helpful during the ideation phase of innovation when new, possibly valuable, ideas are being put forth. Ideas not meeting the requirements mined from the Web can be viewed as risky and should undergo some evolution to ensure success.



2. Mining User and Expert Reviews From the Web

It has become common practice on the World Wide Web for commerce sites to host user reviews. The practice is so ingrained in the Web that an entire field of study known as Collaborative Intelligence, with subfields of study such as Collaborative Filtering [5], has been established. Collaborative Intelligence involves the collective reasoning of multiple users to achieve some goal. Netflix, an Internet commerce site aimed at movie rentals, utilizes Collaborative Filtering to offer recommendations to its users. Netflix is so involved in this type of Collaborative Intelligence that it has offered a substantial prize for any team which develops an algorithm that is at least 10 percent more efficient than the current tools [6]. Much can be gleaned from user reviews to assist in the formulation of requirements for innovation.

User reviews often contain statements of opinion that indicate the strengths and weaknesses of a given product. When they are combined with user reviews for products of similar type, a collective interpretation of the attributes that make a product a success or a failure can be obtained. Figures 1 and 2 illustrate this point for two different portable music players. Note that while the devices may be different, one being an MP3 player and the other a Walkman-style radio, they share similar review attributes, including the need for good sound quality. Reviews of this sort assist in forming the requirements for innovation.

Pros: Good sound quality. Can record normal conversation. The menu system is easy to use.

Cons: The case scratches easily. The ear buds are uncomfortable.


Pros: Stylish case and screen. It is nice to be able to customize the screen to my desires.

Cons: The battery life is a bit short.

Figure 1. Sample MP3 player review (Authors’ construct based on multiple review sites)

 

Pros: I liked the sound quality that this radio puts out.

Cons: Loses stations easily.


Pros: I dropped this while riding a bike and it still plays like a charm.

Cons: None that I can see.

Figure 2. Sample Walkman review (Authors’ construct based on multiple review sites)


To extract expert and user reviews of existing products and services from the Web it is necessary to discover the pages that contain reviews. There are two main methodologies for automated searching of the Web. Standard, or unfocused, crawling of the Web is intended to fetch a page, parse out-links, and repeat [3]. Standard crawling does not consider a specific topic of inquiry; rather, its job is to index all pages available on the Internet. Focused crawling seeks only those pages related to a specific query string.

Even with the assistance of algorithms such as PageRank developed by Google [7], successful standard crawling requires massive hardware and bandwidth. This drawback prevents most corporations from performing this type of crawl internally. Focused crawling requires far less hardware and bandwidth, but does require some sophistication of algorithms to weed out the undesirable links as they relate to the given query. Often, data and text mining algorithms can be used to develop a classifier for organizing reviews presented on Web pages. The classifiers are not always necessary, especially when a specific domain of product types is to be utilized. In the case of a specific domain, a heuristic may be utilized as replacement for the classifier. Experimental results, though, have shown that the classifier system is more accurate at discovering correct pages while the heuristic discovers a greater number of correct pages faster.

An experiment was performed by the authors to test a classifier against the heuristic for the domain of MP3 players. The two algorithms were run for one hour and were set to discover the user reviews related to the domain. Figure 3 below shows the results of this search. Figure 3 depicts the running of the heuristic-based focused crawl and the classifier-based one. Each chart shows the total pages that were discovered during the run versus the total number of pages containing actual reviews. The charts depict the page count on the y axis and the run time of the crawl on the x axis. The charts illuminate the fact that the classifier-based crawl produces a larger percentage of correct pages while the heuristic-based crawl discovers the correct pages faster but with lower accuracy.

A Search Engine for Product Design That Clicks - Heuristic versus classifier results pt. 1

A Search Engine for Product Design That Clicks - Heuristic versus classifier results pt. 2

Figure 3. Heuristic versus classifier results



Figure 4 directly compares the correct results of the two methodologies, clearly showing that a domain-specific heuristic is sufficient for rapid review detection. The two crawls were run simultaneously and the results of this experiment are shown in Figure 4.

 

A Search Engine for Product Design That Clicks - A comparison of heuristic to classifier
Figure 4. Comparison of heuristic to classifier



Once the reviews are discovered on the Web, they are parsed using standard text mining methods and stored in a database of reviews. The database is structured similarly to a transactional database in a supermarket, with one column for each review attribute (i.e., good sound quality, rugged case, etc.). This database, once populated with the other types of requirements sources discussed below, will be utilized to create a frequent itemset (or record) of requirements for the product type being considered.

 

2.1 Population of Transactional Database

Population of the transactional database from the reviews that were discovered in the Web crawl is a twofold process. The first step is to filter out those reviews that offer little or no value. A simple classification system, discussed below, is used to remove the unwanted reviews. The second step is to perform a semantic analysis of each review to identify the individual requirements that form the attributes of the database.

To filter the unwanted reviews from the dataset, a simple decision tree classifier was built utilizing a training set of data. The training set of data was culled from the dataset and a class label of good review (that is, useful to our purposes) or “MP” for “moaner/praiser” review (too effusive or extreme to be reliable) was assigned through a manual inspection. It was discovered that this decision tree is actually extremely short with over 75 percent of the data being properly classified by the first node. The feature space of the decision tree was the words of the reviews with stop words, such as “I,” “the,” “and,” and so forth removed. Each word was given an integer count associated with the number of times the words occurred in the corpus of the training set.

The set of nodes that made up the decision tree were, in order of importance, “none,” “nothing,” “awesome,” “awful,” “junk,” and “kidding.” As previously stated, the first node correctly classified the moaner and praiser reviews 75 percent of the time. This should not be considered surprising, when considering that a praiser will most commonly have nothing negative to say about the product and a moaner will have nothing good to say about the product. The reviews were all formatted to the style of pros and cons through a simple parsing manipulation. Thus, reviews of the type “Pros: I just love this MP3 player, Cons: none” were then easily filtered with the decision tree algorithm (a widely used data mining algorithm).

The task of breaking apart the reviews into individual requirements or attributes for the database is a matter of utilizing part-of-speech (POS) tagging. To accomplish this task, the reviews were broken apart by pros and cons individually. Each segment was then further broken apart through natural stops such as commas or periods. Finally, those low-level segments are tagged with their POS tags. For the task of POS tagging, the WordNet application programming interface [8] was utilized. This API takes a string and returns the part-of-speech tag for that string.

Following the lead of Hu et al. [13], only those segments that contained nouns or noun phrases are used to generate attributes for the database. The reason is that other components of the segment are unlikely to contain product features. Following this vein, a segment such as “good battery life” would be captured as an attribute to store. Prior to the POS tagging, stop words are removed utilizing a standard stop-word list. Con-type reviews are changed to pro-type reviews so that the attributes all reflect a desire on behalf of the consumer. Thus, a review such as “the battery life is too short” undergoes a transformation to “long battery life,” which becomes a requirement in the mining task.

Some amount of fuzzy matching is also involved in the attribute population. The system must have the ability to correlate, as effectively as possible, attributes that mean the same thing but are expressed with different words. The WordNet API [12] also accomplishes this task. This time, the API is used to determine the definition of words. This action is performed on the nouns rather than the adjectives. Through a list of definitions, it is easy to determine the similarity of words. Words such as “display” and “screen” are found to contain the same root meanings and thus are correlated to the same word. The word that is chosen to be the attribute is the one first encountered. The fuzzy matching is also used to account for word similarities such as “auto focus” and “auto-focus.”

In some cases, the noun may be separated by one or more words from the adjective that describes it. Such is the case of “the battery life overall is really good.” In this case, a simple heuristic is used to mate the adjective with the noun: The noun closest to the adjective in the same sentence is used. This seems to account for most of the cases and little risk is induced when using this heuristic.

Once the reviews are segmented and the attributes discovered, a transactional database is created. Each row of the database is formed from exactly one review. Each column of the database is formed from exactly one attribute that was discovered in the dataset. Each column is an integer data type and represents the number of times that the given review contained the given attribute. To allow for dynamic running of this system, columns may be added to the database as new attributes are encountered. All new columns are given a default value of 0 as the attribute was not found in the reviews that make up the data rows. Table 1 illustrates the use of the transactional database for storing the reviews. An entry 1 in Table 1 indicates that an attribute (column) applies to the corresponding review (row), while a 0 entry implies otherwise.

Table 1. Example of a review transactional database. Note that the last column contains a new attribute discovered in review 127. Thus, all previous rows take on the default value of 0 since this requirement was not found in their reviews.

Review ID Long Battery Life Case does not scratch Good Sound Quality Easy to use interface Comes with ear buds
124 1 0 1 1 0
125 0 1 1 0 0
126 1 1 0 0 0
127 1 0 1 1 1


It is from the transactional database that the mining of the frequent requirements for innovation takes place. Each review represents the desires of individual consumers. As such, it is important to ensure that any new innovation meet at least a good amount of these requirements.

 

3. Patent Database Mining

While user and expert reviews offer a wealth of information about the needs and desires of the market, they are not the only source of requirements for innovation. Patent databases can also suggest requirements in a data-driven innovation approach. Patent databases offer complete and summary descriptions of inventions that have been previously envisioned. These text documents can offer a great deal of evidence for the trends in current innovation. Shih et al. [8] have devised a patent trend monitoring system that autonomously searches for patents, dissects them through semantic analysis and reports them in a comparison fashion to show recent trending in specific technological areas.

Patent databases are often more effective for requirements gathering than publications and thesis information. Wen [9] states that patent gazettes reveal over 90 percent of research results for the patents, while more than 80 percent of that information is not enclosed in academic theses and publications. This wealth of information must be tapped in order to formulate true requirements for innovation.

In addition to the gathering of requirements, this research information assists in the ideation phase of innovation. To that end, Wen et al. [10] have researched utilization of patents for the invigoration of the creativity processes. Figure 5 illustrates a simple advancement that was discovered in a patent database. The use of extending arms to aid in stability of the Christmas tree stand could be utilized in multiple areas during the ideation phase of an innovation that would require balance and stability.

A Search Engine for Product Design That Clicks - Christmas tree stand patent
Figure 5. Christmas tree stand patent. U.S. Patent Number 5492301


Mining Web-based patent databases is similar to the mining of user reviews, given that both may be viewed as text documents with html markup added. A classifier is not necessarily needed to discover which documents are patents, as a simple heuristic may suffice to mine sites such as Google Patents [11] because of the domain specificity of these sites. Similar to review mining, searching patents requires semantic analysis to determine which text represents a possible requirement for innovation.

Determining which text represents a possible requirement can be more challenging in patent documents than in reviews because there is more text present in the documents. To disseminate the requirements in the patent, the text is segmented naturally at the sentence breaks. Each sentence is then run through a trained classifier to determine its relevance to requirements. This classifier, being previously trained by domain experts, automatically sorts the sentences, or other segmentations, into categories of relevance. Once sorted, each segment is dissected to extract the requirements for innovation.

Requirements discovered in the patent documents are stored in the same transactional database as the review requirements. Storing the requirements in one transactional database simplifies the frequent itemset generation discussed in Section 4 below. Additionally, the user may choose to place these requirements in a creativity database for use during the ideation phase of an innovation project.


 

4. Mining Frequent Requirements

Once the requirements-gathering process is complete for a given period of time, the task of determining requirements pertinent to making the innovation more market-acceptable emerges. Within the transactional database there will be various requirements that were gathered that may not be good indicators of market success, since they may have been mentioned only once or twice in the entire set of reviews or patents. To determine reviews that may increase market acceptance of an innovation, we used an a priori type market basket analysis. The a priori analysis discovers groups of requirements that appear frequently together across all records (itemsets) in the transactional database.

The transactional database forms a sparse dataset from which the frequent requirements are mined. The dataset is sparse in the sense that each entry contains a very small number of requirements. Scarcity is not a concern of the frequent itemset generation algorithm used here.

To mine the frequent requirements, a metric is defined for support—that is, how frequently it appears in the database. Support, similar to that defined by Tan et al. [18], determines how often an itemset is applicable to a given dataset. Given an itemset X = {x1, x2, …, xn} and a dataset D with number of transactions |D| the support, σ, of the given itemset is defined to be the number of transactions k in D that contain the itemset divided by the cardinality of the dataset as in Eq. (1).

A Search Engine for Product Design That Clicks - Equation 1     (1)  

 

For an itemset to be considered frequent, the support of that itemset must be greater than some threshold τ, which is a user-defined parameter. The use of the minimum support threshold ensures that generated itemsets can truly be considered a requirement and not simply an anomalous entry.

Generation of the frequent itemsets is performed in accordance with the standard a priori algorithm [17]. Candidate one-item itemsets are generated for each of the attributes in the transactional dataset. Each candidate is checked to see if its support is above the minimum support threshold τ. Those candidates whose support is less than τ are discarded, as they do not hold to the a priori property. The a priori property states that all nonempty subsets of a frequent itemset must also be frequent.

Once the frequent one-item itemsets are generated, they are combined to form two-item itemsets. These candidate itemsets are then checked for support—that is, how often the items in the set appear together in the same review. This process of candidate creation and checking is performed iteratively until itemsets to combine no longer exist.

The generation of the frequent itemsets from the transaction database forms the requirements for innovation. Starting with the largest frequent itemset, one can determine the requirements that are described by the reviews simply by observing the items in the itemset. Thus, an itemset generated from the review of MP3 players may be represented as: Long battery life, good sound quality, tough case, and accepts standard accessories.

While frequent itemset generation takes a large step forward in determining the requirements for innovation, they do not offer complete understanding of how the requirements interact. For example, there may exist five sets of frequent requirements of length 5 (each containing five items) in the transactional database as determined by the a priori mining. The requirements “long battery life” and “good sound quality” may be found in each of the itemsets. This indicates that these requirements are special in relation to the others in the itemsets. This quality of being special, though, is not described by the frequent itemset generation.

The use of an AND-OR tree can help us understand the unique relationship each of the requirements has within the frequent itemsets. AND-OR trees have been applied in areas from logic representation to Web mining [14] [15] [16]. The construction of this AND-OR tree is taken a bit differently though. The purpose of this tree is to aid in determining the significance of each attribute within the frequent itemset.

The AND-OR tree is constructed iteratively on each set of itemsets in descending cardinality. A root node, with name requirements, is generated. Starting with the set of itemsets whose cardinality, the number of items in the itemset, is the largest and whose member sets count is greater than one, a node with the name of the cardinality of the itemsets is appended to the root node. The process of relationship determination is then performed for each attribute in the itemsets.

Relationship determination is performed by iterative search through the set of itemsets to find the frequent attributes in the set. To be considered frequent, each attribute must appear in at least ρ number of itemsets in the set of itemsets. Each frequent attribute forms a node that is appended to the cardinality node in the AND-OR tree. These nodes are then joined with arcs to represent AND conditions. The remaining attributes, which are not frequent in the set of itemsets, form child nodes to the last frequent node in the current branch of the tree. These nodes form the OR conditions of the relationship. This process is continued iteratively for each successive smaller set of itemsets until a stopping criterion (user determined) is met or the remaining itemsets are of cardinality 1.

For example, given a set of three frequent itemsets of cardinality 3 illustrated in Figure 6, the AND-OR tree representing the relationship of these itemsets, with ρ = 2 is presented in Figure 7.

Long Battery Life, Good Sound Quality, Clear Display

Tough Case, Long Battery Life, Good Accessories

Good Sound Quality, FM Tuner, Long Battery Life

Figure 6. Three frequent itemsets of cardinality 3

 

A Search Engine for Product Design That Clicks - The full AND-OR tree would have branches for other cardinalities as wel

Figure 7. Partial AND-OR tree for requirements in Figure 6 (only the branch for cardinality of 3 is shown. The full AND-OR tree would have branches for other cardinalities as well. In this example, long battery life and good sound quality would be important, perhaps essential, requirements for a product’s success. Considerations of display, case, accessories, and tuner appear to be less important.


Thus, the AND-OR tree in Figure 7 defines the relationship of “long battery life” and “good sound quality” as the most important requirements within the set of frequent itemsets of cardinality 3. Additionally, the children of “good sound quality” determine less frequent requirements of importance to the innovation process. The children are attached in the tree under the least frequent AND node. These items represent requirements that would be desirable, but are not as important to the requirements assessment as the AND nodes. Thus, in the case of the tree in Figure 7, Long Battery Life and Good Sound Quality are requirements that carry a great deal of importance to the innovation generation, while requirements such as Clear Display and Tough Case represent requirements that will assist in increasing the success of an innovation as long as the AND requirements are met. As one can see, the AND-OR tree visualizes the frequent itemsets, in order of importance in the database, and makes it easy to determine the most important factors when considering an incremental innovation.

The use of the AND-OR tree offers more than just visualization. With the requirements for innovation formed in such a structure, evolutionary computation algorithms could be utilized to compare and evolve current innovation ideas into those that would be more market-acceptable.

 

5. Conclusion

Incremental innovation is an evolutionary process. The requirements for an idea to be deemed innovative through market acceptance are changing in time. To keep up with these changes and to aid in the market acceptance of an idea, the mining of current requirements for innovation is crucial. Knowing the current requirements for innovation in a given domain is helpful in ensuring that the idea will not fail as well as in the ideation process of innovation.

This article has presented a methodology for mining the frequent requirements of innovation in an automated fashion. Once the requirements have been gathered, the process of dissection and determination of the frequent requirements is achieved by data mining algorithms. The output of the data mining process, the AND-OR tree, can be used to evaluate and evolve current innovations. Through the continued use of the methodology presented in this paper, the changes in requirements may be monitored and a company utilizing the methodology may be better positioned for market success.

 

References

[1] Ulwick, A.W. “Turn Customer Input Into Innovation”. Harvard Business Review, 80 (1), 2002, pp. 91-97.

[2] von Hippel, E. and Katz, R. “Shifting Innovation to Users via Toolkits.” Management Science, 48 (7), 2002, pp. 821-833.

[3] Blazevic, V., and Lievens, A. “Managing Innovation Through Customer Coproduced Knowledge in Electronic Services: An Exploratory Study.” Journal of the Academy of Marketing Science, 36, 2008, pp. 138-151.

[4] Cyveillance Press Resource Center Web site: http://web.archive.org/web/20060110054200/http://www.cyveillance.com/web/newsroom/stats.htm

[5] Segaran, T. Programming Collective Intelligence, O’Reilly, Sebastopol, CA, 2007.

[6] The Netflix Challenge Web site: http://www.netflixprize.com

[7] Zhai, Y., and Liu, B. “Web Data Extraction Based on Partial Tree Alignment”. Proceedings of the 2005 International World Wide Web Conference, May 10-14, 2005. Chiba, Japan, pp. 76-85.

[8] Shih, M., Liu, D., and Hsu, M. “Mining Changes in Patent Trends for Competitive Intelligence.” PAKDD 2008, LNAI 5012, pp. 999-1005, Springer 2008.

[9] Guihua, Wen. Rotating Dynamics for Computational Creativity. National Defense Industry Press, Beijing 2005.

[10] Wen, G., Jiang, L., Wen, J., and Shadbolt, N. “Generating Creative Ideas Through Patents.” PRICAI 2006, LNAI 4009, pp. 681-690, Springer 2006.

[11] Google Patent Search Web site: http://www.google.com/patents

[12] WordNet API Web site: http://wordnet.princeton.edu/links

[13] Hu, M. and Liu, B. “Mining Opinion Features in Customer Reviews.” Proceedings of the 19th National Conference on Artificial Intelligence, San Jose, CA, 2004.

[14] Bruynooghe, M. “A Framework for the Abstract Interpretation of Logic Programs.” Technical Report CW62, Department of Computer Science, Katholieke Universiteit Leuven, October 1987.

[15] Muthukumar, K. and Hermenegildo, M. “Compile-Time Derivation of Variable Dependency Using Abstract Interpretation.” Journal of Logic Programming, pp. 315-347, July 1992.

[16] Luby, M., Mitzenmacher, M., and Shokrollahi, M. “Analysis of Random Processes via And-Or Tree Evaluation,” Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. San Francisco, CA, 1998.

[17] Han, J. and Kamber, M. Data Mining: Concepts and Techniques 2nd Ed., Morgan Kaufmann 2006.

[18] Tan, P., Steinbach, M., and Kumar, V. Introduction to Data Mining, Addison Wesley 2006.

ABOUT US | BACK ARTICLES | ASME.ORG | ADVERTISE | CONTACT US | Terms of Use | Privacy Statement | Copyright © 1996-2012 ASME International. All Rights Reserved.