Consider an online ad campaign run by an advertiser. The ad serving companies that handle such campaigns record users’ behavior that leads to impressions of campaign ads, as well as users’ responses to such impressions.
This is summarized and reported to the advertisers to help them evaluate the performance of their campaigns and make better budget allocation decisions. The most popular reporting statistics are the click-through rate and the conversion rate. While these are indicative of the e? ectiveness of an ad campaign, the advertisers often seek to understand more sophisticated long-term e ects of their ads on the brand awareness and the user behavior that leads to the conversion, thus creating a need for the reporting measures that can capture both the duration and the frequency of the pathways to user conversions.
In this paper, we propose an alternative data mining framework for analyzing user-level advertising data. In the aggregation step, we compress individual user histories into a graph structure, called the adgraph, representing local correlations between ad events. For the reporting step, we introduce several scoring rules, called the adfactors (AF), that can capture global role of ads and ad paths in the adgraph, in particular, the structural correlation between an ad impression and the user conversion.
We present scalable local algorithms for computing the adfactors; all algorithms were implemented using the MapReduce programming model and the Pregel framework. Using an anonymous user-level dataset of sponsored search campaigns for eight di? erent advertisers, we evaluate our framework with di erent adgraphs and adfactors in terms of their statistical t to the data, and show its value for mining the long-term behavioral patterns in the advertising data. Keywords sponsored search, ad auctions, online advertising, PageRank, user behavior models, clickthrough rate, conversion rate .
The Internet has become a major advertising medium. Although a number of di erent factors contributed to this, what distinguishes the Internet advertising from the o ine advertising competitors is its inherently interactive nature. Measuring e ectiveness of a particular advertising campaign and allocating the advertising budget optimally was and still remains a very challenging task, yet the Internet made the task easier by connecting ad impressions 1 to tangible user actions and artifacts such as posing a search query, clicking on an ad or converting 2 .
The simplicity of measuring and attributing user clicks has established the clickthrough rate (CTR) 3 as the current de-facto standard of ad quality for sponsored search. It is now customary to de ne the advertiser’s optimization problem as maximization of the expected number of ad clicks given a certain budget constraint. The conversion rate (CR), de ned similarly as the probability of the user conversion, is another popular ad e ectiveness measure; together with the CTR it is frequently used by the advertisers to measure the return on investment of speci? c keywords in the advertising campaign.
Recent empirical studies show that the e? ects of online ads cannot be fully captured by the CTR or the CR. In particular, the sponsored search advertising, as well as the display advertising, can have a signi cant number of indirect e ects such as building the brand awareness. For instance, Lewis and Reiley, in cooperation between Yahoo! and a major retailer, performed a randomized controlled experiment to measure the e ect of the online advertising on sales. They found that the online advertising campaign had substantial impact not only on the users who clicked on the ads but also on those who merely viewed them.
The advertisers seek to understand the impact of their ad not just on the immediate click or conversion, but the likelihood of the eventual conversion in the long term and other long term e cts. Users take speci c trajectories in terms of the search queries they pose and the websites they browse, and this a ects the sequence of ads they see; conversely, the sequence of ads they see a?ects their search and browsing behavior.
This interdependence results in structural patterns in users’ behavior; advertisers need new tools and concepts beyond simple aggregates (like the CTR and the CR) to understand them. What are the systematic ways to help the advertisers reason about structural correlations in the data? In this paper, we take an advertiser-centric data mining approach. We start ith data that is directly pertinent to the advertiser’s campaign, that is, user trajectories that involved ads from the campaign of that advertiser, including ad impressions, clicks and conversions.
Such data can usually be reported to the advertiser, provided it is aggregated and anonymized appropriately. Next, we build a data mining framework that can help advertisers identify structural patterns in this data. Our contributions are as follows. We propose a graphical model based approach. We formulate graphs from the data called adgraphs to capture co-occurrences of events adjacent to each other in users’ trajectories.
Then, we introduce a variety of adfactors, where every adfactor is a scoring rule for nodes in the adgraph: these are designed to capture impact of ad nodes on eventual conversion. For example, we introduce adfactors based on random walks, that, for every event, calculate the long term probability that a certain random walk involving that event would eventually lead to conversion. Our paper presents highly ef cient algorithms for constructing adgraphs and computing all adfactors we introduce. All algorithms were implemented using MapReduce parallel programming model. Using data from the sponsored search campaigns of eight di erent advertisers, we study various adgraphs and adfactors. We validate the adgraph models by showing their statistical t to the data.
Also, we show interesting empirical properties of the adfactors that provide insights into user behavior with respect to brand vs non-brand ads. Moreover, we show various natural data mining queries on adgraphs and adfactors that maybe of independent interest to advertisers. Finally, using adfactors, we show how to e ciently prune the adgraph to localize and depict the in? ence of any particular ad by its small neighborhood in the ad graph.
Our approach works by transforming the dataset of users’ trajectories into graphs. This is achieved by pooling data from di erent users. As a result, adgraphs lose information about speci c users or their trajectories, and only encode aggregate information. Also, because of the way adgraphs pool data only based on adjacent events, they encode certain independence assumption about user behavior over paths of multiple edges. We carefully study statistical t of adgraph models to the data to ensure that this assumption is reasonable.
Pooling this way also results in signi cant compression. For large advertisers in sponsored search, the number of different queries the advertiser can be matched with is often in tens of thousands, and the number of viewing users can be in millions. In contrast, adgraphs have more manageable sizes. Our approach generalizes to di? erent settings. While in this paper we apply it to study user conversions in the sponsored search data, it can equally be used to study trajectories in sponsored search that lead to clicks only, or conversions when both sponsored search and content-based display ad data is available, etc.
There are empirical studies showing that the sponsored search advertising, as well as the display advertising, can have a signi cant number of the indirect e ects such as building the brand awareness and lift in the cross-channel conversions. The prior research attempted to measure impact of an ad on the user conversion by the number of independent paths from the ad to the conversion event.
Our work here is a substantial generalization of the prior research, as we apply more advanced PageRank-based measures for analyzing pathways to the user conversion and, more importantly, introduce a generic adgraph/adfactor based framework for reasoning about the structural properties of the users’ behaviors. Mining patterns in the user behavior is not a novel idea. Market basket analysis using association rule mining is a popular tool used by retailers to discover actionable business intelligence from user level transaction data.
User behavior data has numerous applications, including, but not limited to, improving the web search ranking, fraud detection and personalization of the user search experience . In this paper, we mine the user behavior from the perspective of an advertiser running a sponsored search campaign. There are several standard mining techniques that can be applied to the user-level advertising data such as mining for the frequent itemsets or the frequent episodes in sequences of user actions.
However, these do not capture the structural correlations in the data such as the impact of multiple paths between events that our work captures. A different approach that attempts to capture structural correlations is mining the frequent substructures in graph data. Due to data pooling in our graph construction, patterns that we identify may or may not have high support, because some patterns may combine behavior of multiple users. Also, we are not interested simply in frequent patterns but in patterns that indicate a high likelihood of the designated action: user conversion.
Further, our approach of using the steady state probabilities of dierent random walks to capture structural correlations di ers from the prior graph mining techniques. This approach is widely used in other areas including web search , personalized video recommendations and user signatures in communication graphs. Our research is closely related to the problem of modeling user search sessions, for which a wide number of solutions have been proposed in the literature, including advanced latent state models such as vl-HMM and Markov models that take into account transition time between user’s actions.
We intentionally refrained from using complex generative models for the underlying data due to several reasons. At rst, generative models often require individual user sessions for training, while, due to privacy reasons, advertisers usually have access only to aggregate level data. At second, advertisers generally do not have access to information on user actions for which the advertiser’s ad was not shown (even at the aggregate level). Finally, the patterns of user behavior extracted from the data must be easy to communicate and represent, ideally in a graphical form. This is often not true for generative statistical models.
As the number of non-converted (yet) users is much larger than the number of converted users, we sampled from the pool of such users randomly with a sampling probability of 1%. Our sample covers 8 di erent anonymous advertisers, who were running a sponsored search campaign with Google. The dataset was collected at user level and includes information on a random sample of users who “converted” with the advertiser within a certain time period of several weeks.
The activity of every user was tracked by a cookie, thus any user who deleted the cookie was later identi ed as a different user. For every (anonymous) user, the data has the information on all actions of this user before the conversion, where we de? ne an action as either a search query issued on Google and for which the advertiser’s ad was shown in the paid search results or a click on the advertiser’s ad. In the rest of the paper, we will refer to the ? rst action type as an “impression” event (as it resulted in the advertiser’s ad impression) and to the second action type as a “click” event.
We emphasize that search queries for which the advertiser’s ad was not shown, such as irrelevant search queries, queries on which the advertiser bid too low, or queries for which the advertiser was excluded from the auction due to a daily budget constraint, are not included in our dataset. While such data might be available in some form to the search engine, it is not reported to the advertisers, e. g. due to several privacy issues, and therefore we intentionally refrain from including it as an input into our data mining process.
Same applies to the ad position and the competitors’ information in the sponsored search auction: advertisers do not observe their ad placement and competitors’ ads on an individual query basis. The information that we assume to be available for every event includes only the event timestamp, the search query issued by the user and the match type (exact or broad 4 ).
The number of data points for the converted users varied from approximately 30,000 impressions and 10,000 clicks for the smallest advertiser to about 5. 8 million impressions and 2. million clicks for the largest advertiser. The number of dierent user queries in the data varied from approximately 500 for the smallest advertiser to about 27,000 for the largest advertiser. A special event in our dataset is a user conversion. User conversions were reported by the advertisers, therefore the exact denition of what constitutes the conversion event is advertiser-speci? c. In practice, it can vary from visiting a particular web site page or registering on the website to making an expensive purchase online.
We will use the term “conversion path” to represent an ordered sequence of events for a single user that ends in a conversion. Finally, our dataset also includes data on users who were exposed to at least one ad impression of the advertiser but Most search engines support broad match functionality, which allows for an imprecise match between a query issued by the user and the keyword the advertiser bids on. 5 Note that the high clickthrough rate implied by these numbers is due to the fact that we describe the sample of users who eventually converted with the advertiser.