Chapter 9 App Store Analytics

9.1 Motivation

In 2008, the first app stores became available [7][148]. These marketplaces have grown rapidly in size since then with over 3.8 million apps in the Google Play store the first quarter of 2018 [175]. These app stores, together with a large amount of user-generated data associated with them present an unprecedented source of information regarding the ecosystems of modern software platforms, such as Android and iOS. Software developers and researchers use these valuable data to gain new insights about how users use their apps, what they like, and what difficulties they encounter. The same data can be also used by app store operators for the detection of malicious or misbehaving apps, i.e., whose behavior does not match their app store description. Since these app stores are relatively new, the research field of App Store Analytics is still not mature. However, because apps are used to a great extent nowadays, their investigation plays a vital role in the field of Software Engineering.

In 2015, Martin, William, et al., published a survey on app store analytics for software engineering [123]. The paper divides the field of App Store Analytics into seven categories: API Analysis, Review Analytics, Security, Store Ecosystem, Size, and Effort Prediction, Release Engineering, and Feature Analysis. Given that relevant literature on the field of App Store Analytics has been already gathered and analyzed, it makes sense for us to use this chapter to go a step further. We delve deeper into the subfield of Review Analytics. We start with the literature proposed by Martin et al. in [123] and extend it with the relevant articles that were published after 2015. We then try to answer the following research questions:

RQ1 What is the state of the art in Review Analytics? Specifically:
- Which topics are being explored?
- Which research methods, tools, and datasets are being used?
- Which are the main research findings (aggregated)?
RQ2 What is the state of practice in Review Analytics? Specifically:
- Which are the tools and the companies that create/employ them?
- Are there any case studies and which are the findings?
RQ3 Which are the challenges that the field is facing or will face?

9.2 Research Protocol

In this section, we explain the process that we have followed to systematically extract the appropriate facts from the articles for our literature survey. The search strategy section includes the queries that were used, as well as the initial criteria that were taken into account for the initial filtering of the articles. Article selection explains how the relevance of the papers was assessed, and how the final filtering was done. The last section describes the data extracted from each article.

9.2.1 Search Strategy

As stated in the motivation, the survey by Martin et al. [123] gave us a starting point. Instead of gathering everything related to App Store Analytics we decided to focus on extending the work done by the authors and answering the research questions specifically related to the subcategory of Review Analytics. We retrieved and inspected all the papers of the Review Analytics category that were mentioned in the aforementioned survey and we identified specific keywords and topics for our survey.

After doing this, the following search queries were generated:

“app store analytics” 
“app store analytics” AND “mining” 
“app store analytics” AND “user reviews” 
“app store analytics” AND “reviews”, “app reviews”

Google Scholar¹⁷, ACM Digital Library¹⁸, and IEEE Xplore¹⁹ were used for the searching process with the queries mentioned above. In order to build a database using only the relevant articles, the following inclusion criteria were applied. In addition to the results obtained by searching with the specific queries, the first page of the “related articles” link for the top cited articles was also inspected.

Criteria	Value
time frame	2015-present
journals and conferences	TSE, EMSE, JSS, IST, ICSE, FSE, MSR, OSDI, MobileSoft
keywords in title	user reviews OR app store reviews
keywords in abstract	user reviews OR app store reviews

After the selection of the relevant papers, we gathered metadata associated with them, and we entered them into a database. That was the input for the next step of the survey, article selection.

9.2.2 Article selection

Taking into account that the papers considered in this survey were published from 2015 and on, it is not a surprise that most of them are not highly cited yet. As a consequence, the selection of the filtered articles does not take the number of citations into consideration. In contrast, each member of the group was in charge of delving into the one-third of the papers of our database and finding, for each study, its relevance with respect to Review Analytics and the proposed research questions. Then, that score was peer-reviewed by the other two authors of our survey to achieve a consensus.

For the relevance score, we considered how much the paper used terms related to the analysis of user reviews. Next, we present three examples: 1) a highly relevant paper score=10), 2) a somewhat relevant paper (score=5) and 3) a non-relevant paper (score=0).

What would users change in my app? Summarizing app reviews for recommending software changes [59] - Relevance score: 10 - Remarks: the authors applied classification and summarization techniques on app reviews to reduce the effort required to analyze user feedback. As it can be seen, the paper was focused on using the reviews to improve the development process.

Fault in your stars: an analysis of Android app reviews [8] - Relevance score: 5 - Remarks: the authors analyzed the problem of the potential mismatch between the reviews and the star ratings that apps receive. Although the paper is related to app reviews, it is not its primary focus.

Why are Android apps removed from Google Play? A large-scale empirical study [190] - Relevance score: 0 - Remarks: in this case, the paper did not have to do with app reviews even though the title suggested that.

In the end, only the articles that had a score of 5 or more were used for the fact extraction and the subsequent investigation of the research questions.

9.2.3 Fact extraction

As it was mentioned before, the articles were listed in a database in a structured fashion. The data that was extracted has the following fields:

id for indexing
title
year 
relevance score 
relevance description 
source 
category
authors Information
source (journal or conference)
complete reference

Additionally, for each one of the articles, a systematic reading was carried out in which bullet points that answered the following questions were generated:

Paper type
Research questions of the paper
Contributions
Datasets: size and sampling methodologies
Techniques used for doing the analysis

9.3 Answers

9.3.1 RQ1 What is the state of the art in Review Analytics? Specifically:

Which topics are being explored?
Which research methods, tools, and datasets are being used?
Which are the main research findings (aggregated)?

To answer the questions at hand, we looked at the novel ideas and the research that has been done in the field of Review Analytics. In their survey, Martin et al. [123] proposed “Classification”, “Content”, “Requirements Engineering”, “Sentiment”, “Summarization” and “Surveys and Methodological Aspects of App Store Analysis” to categorize the existing literature. After analyzing the subsequent work, we suggest new categories that reflect the state of the art in Review Analytics. These are: “Review Manipulation”, “Requirements Engineering”, “Mapping user reviews to the source code”, “Privacy / App Permissions”, “Responding to reviews”, “Comparing Apps and App Stores” and “Wearables”. In the following sections, we describe each one of these categories using the new papers that we have found.

Review Manipulation

Recently, significant attention has been paid on how the reviews and ratings can be used to influence the number of downloads of a particular app in an App Store. The paper by Li et al. [114] analyzed the use of crowdsourcing platforms such as Microworkers²⁰ to manipulate the ratings. The authors merged data from two different sources, an App Store and a crowdsourcing site, to identify manipulated reviews.

Chen at al. [42] proposed an approach to identify attackers of collusive promotion groups in an app store. They use ranking changes and measurements of pairwise similarity to form targeted app clusters (TAC) that they later use to pinpoint attackers. A different approach to the same problem was proposed in [194]. In that paper, Xie et al. identified manipulated app ratings based on the so-called attack signatures. They presented a graph-based algorithm for achieving this purpose in linear time.

These papers also show that the percentage of apps that manipulate their reviews in the app stores is small — less than 1 % of the apps found to be suspicious [194]. Regarding the datasets used in this category, the work done by Li et al. [114] used a smaller amount of app store data, but they merged it with data from an external crowdsourcing site. In the case of the other two previously presented papers, they can be we can say that the authors have used a small number of considered apps (compared to the total number of apps in the main marketplaces), but the amount of the analyzed reviews was significant. That makes sense, considering that the main purpose of both studies was to examine app reviews.

Requirements Engineering

Users use reviews as a way to express their attitude (positive and negative) towards an app. The information value of individual reviews is usually low, but as a whole, they can be mined and provide useful insights for the improvement and advancement of the apps. The survey by Martin et al. [123] described summarization, classification, and requirements engineering as categories of the subfield of Review Analytics. These areas are converging and in recent work (after 2015) summarization and classification, among other techniques, are being used to complement the development, maintenance, and evolution of the apps.

In 2016, Di Sorbo et al. introduced SURF [59], an approach to condense a large number of reviews and extracting useful information from them. In a subsequent paper Panichella et al. [146] presented techniques for classifying useful feedback in the context of app maintenance and evolution. In their work, the authors made use of machine learning supervised methods in conjunction with natural language processing (NLP), sentiment analysis (SA), and text analysis (TA) techniques. Unsupervised methods for review clustering were explored in a paper by Anchiêta et al. [6]. The authors introduced a technique to categorize reviews in order to generate a summary of bugs and features of apps. Taking into account the high dimensionality of the datasets that are used for review mining, Jha et al. [92] proposed a semantic approach for app review classification. By using semantic role labeling, the authors could make the classification process more efficient. Gao et al. presented IDEA [70]; a framework for identifying emerging app issues based on an online review analysis. IDEA uses information from different time slices (versions) for the identification of the issues, and the changelogs of the studied apps as the ground truth for the validation of their approach. It is the only paper in the reviewed literature that presents a case study (of deployment in Tencent) as an example of the viability of IDEA.

Most of the datasets used in the papers we present in this section analyze a small number of apps and a substantial amount of reviews. Di Sorbo et al. [59] used 17 apps and 3,439 reviews. Anchiêta et al. [6] gathered more than 50,000 reviews from 3 apps, but after a preprocessing step, the dataset was reduced to 924 records. Jha et al. [92] used a structured sampling procedure to mix reviews from different datasets. The final size came to 2917 reviews. Finally, Gao et al. [70] used reviews from 6 apps (2 from the App Store and 4 from Google Play), and the final dataset contained 164,026 reviews. This study used one of the largest numbers of reviews that have been ever studied.

Mapping user reviews to source code

Another research direction that has become active over the past years is the combination of the mining results of app reviews with the source code, to directly provide developers with valuable and actionable information for the improvement of their products.

Palomba et al. [144] presented a new approach called ChangeAdvisor. It extracts useful feedback from reviews and recommends changes to developers for specific code artifacts. In the paper, metric distances such as the Dice coefficient are used in order to map a specific subset of reviews to a particular section in the source code. Complementary work was presented in [143]. In this study, Palomba et al. investigated the extent to which app developers take app reviews into account. To achieve this, they introduced CRISTAL. It pairs informative reviews with source code changes and monitors the extent to which developers accommodate crowd requests from the reviews.

Linting mechanisms and their effectivity have been also studied. Wei et al. [192] proposed a method, OASIS, for prioritizing the Android Lint warnings. It uses NLP techniques (tokenization, word removing, word stemming, and TF-IDF distance) to find links between user reviews and Lint warnings. According to the paper, this is relevant given that one of the problems of linters is the large number of false alarms they provide.

Regarding the datasets that were used for validation here, even though the analyzed papers use a limited number of apps (except for [143] that used 100), as stated in the previous section, they use numerous reviews (more than 20,000). Additionally, as the aim of the works in this section directly involve the developers, they have attempted to complete the quantitatively apps-reviews-based experiments with qualitative ones.

Privacy / app permissions

Permission approaches used by mobile devices’ software platforms have been changed. Android Marshmallow the Android operating system uses a run-time permission-based security system. Also, Apple’s iOS also uses run-time permissions on top of a set of permissions enabled by default. Scoccia et al. did a large-scale empirical study on these new system challenges by inspecting 4.3 million user reviews from 5,572 Google Play store apps [168]. Using different techniques they extracted 3,574 user reviews that relate to system permissions. They found that users like the minimal permissions as most apps only ask for permissions they strictly need. Some of the negative user concerns were apps asking for too many permissions or asking for permissions at bad timing.

Responding to reviews

Developers can respond to reviews in the Google Play store from 2013. Also, Apple introduced the same feature in 2017. In a previous work, McIlroy [127] studied whether responding to user reviews has a positive effect on the rating users give. Building on previous work, McIlroy et al. studied how likely it is for users to change their rating for an app when a developer responds to their review [128]. They found that users change their rating, with probability 38.7 percent, when a developer responds to their review, and with a median increase of 20 percent in rating for an app.

Hassan et al. [83] used 2,328 top free apps from the Google Play store to study whether users are more likely to update their reviews if a developer responds to these. They extracted 126,686 dialogues between developers and users and concluded that responding to a review increases the chances of users updating their given rating for an app by up to six times compared to not responding. They also studied the characteristics, likelihood, and the incentives of user-developer dialogues in app stores.

Comparing Apps and App stores

Papers that compare apps or app stores are discussed in this section. Li et al. mined user reviews from Google Play to find comparisons between apps [115]. They set out to identify comparative reviews to extract differences between apps on different topics. For example, a user says in a review that this app is not as good regarding power consumption as another app. Li et al. created a method that with sufficient accuracy extracts these opinions and provides comparisons between apps.

Ali et al. did a comparative study on cross-platform apps. They took a sample of 80.000 app-pairs to quantitatively compare the same apps across different platforms and identify the differences among software platforms [5].
In a related study, Hu et al. compared app-pairs that are created using hybrid development tools such as PhoneGap [88]. With this approach, they found that in 33 of the 68 app-pairs the star rating was not consistent across platforms.

Wearables

User reviews can be also used for understanding user complaints. Mujahid et al. did this in a case study on Android Wear [139]. More specifically, they sampled 589 reviews from 6 Android wearable apps and found that 15 unique complaint types could be extracted from the sample. The sample was created from mining the reviews of 4,722 wearable apps and selecting the apps that had more than 100 reviews with a rating of one or two stars. After manually assigning categories to the reviews in the sample, they concluded that the most frequent complaints in this relatively small sample were complaints related to cost, functional errors, and a lack of functionality.

9.3.2 RQ2 What is the state of practice in Review Analytics? Specifically:

Which are the tools and the companies that create/employ them?
Case studies and the findings.

A large amount of work has been published in the field of app store analytics (the survey by Martin et al. was done on over 250 papers, and our study additionally analyzed 80 papers), only a few of them were case studies. After applying the article selection criteria, we were left with 30 papers from which only two conducted a case study. The first case study of this selection, done by Mujahid et al. [139], examines issues that bother the users of Android Wear. The second one, done by Gao et al. [70], reports on the performance of their review analytics tool which was deployed at Tencent. This creates a large gap between the number of proposed solutions by the researchers and the number of studies done on the solutions that had been deployed in practice.

9.3.2.1 State of the app stores

We were not able to find any academic publication on the solutions used by the actual app stores (Google Play and App Store), but we have found that both app stores automatically detect fake reviews,²¹ ²². In 2016 Google deployed their review analytics solution, called Review Highlights²³, into production for both developers and customers, but at the time of writing this survey (October 2018), the Review Highlights no longer show up in the public facing part of Google Play Store and are only accessible in the developer’s console²⁴.

9.3.2.2 State of the third-party tools

Many third-party services focus on app store analysis. They usually focus on showing aggregated statistics from the app stores and some of them also analyze user reviews. Tools such as AppBot²⁵ or TheTool²⁶ perform sentiment analysis on the reviews and show it as an additional metric next to the star rating. AppBot also categorizes reviews based on the keywords inside them. It may seem like a low-tech approach, but it makes it easy for the users to reason about why a certain review is assigned to a specific category. We were not able to find any third-party tools that would use any of the advanced Machine Learning algorithms from the papers we analyzed. This finding combined with Google hiding their Review Highlights may show that most of the algorithms proposed by the researchers do not generalize very well in the real-world application and are hard to comprehend by regular users.

To this day there are, to the best of our knowledge, no tools that compare apps and app stores using user reviews. Some tools analyze apps and app performance, but there are no tools that do comparisons.

9.3.3 RQ3 Which are the challenges that the field is facing or will face?

In this section, we present challenges and future research directions for the field of Review Analytics and the subcategories that were identified in previous sections.

A significant aspect of the research process is the validation of the proposed tools and frameworks and the assessment of how generalizable they are. To accomplish this, it is necessary for researchers to have large datasets of reviews and more representative and sound samples. Machine learning has seen a steady increase in popularity, however, it has not been used a lot in the field of Review Analytics. Also, most of the studies rely on correlation relationships to validate the effectiveness of their approaches. There is a need to apply causation techniques so confounding factors can be ruled out.

There is a difficulty in trasforming research into actual tools that are used in a real-world setting. Of all the works that were considered here, only Gao et al. presented a case study [70], and most of the tools from other papers are either unavailable or not actively used.

Review Manipulation: It is important to combine multiple sources of data to identify suspicious apps. Not just from app stores alone, but also from crowdsourcing sites and even from social networks. Also, the sample should be carefully selected, given that the number of suspicious apps is not significant (around 1% of all apps), taking into account the size of the app stores.

Requirements Engineering: More case studies are needed in order to validate how useful the extracted requirements are in the context of software development. Also, as machine learning is starting to be heavily used, models that are tailored to the review data are expected to be created. The large amounts of noise that is present in the reviews is still a challenge that needs careful studying of the preprocessing steps that are used while assembling the datasets.

Mapping user reviews to source code: There is a need for developers to improve their programs by combining reviews with source code datasets. A likely future trend is the analysis of update-level changes. Regarding this, there is still a need to obtain update-categorized reviews as this is still a challenge for the current review-retrieving approaches.

Privacy/app permissions: Quantifying and understanding the effects of run-time permission requests on the user experience is still an open research area which can help developers to increase the quality of their apps. One of the directions for further research regarding permissions is the idea of giving permissions to specific app functionalities instead of giving permissions to the app as a whole. That could lead to better user-understanding why an app needs certain permission and could reduce the number of permissions an app needs. Another possible future direction is researching and creating tools that help developers put permission requests in the right place to avoid bugs related to permissions and permission requests.

Responding to reviews: There is a need for an in-depth study of how developers and users are using the review mechanism to find out how this mechanism can be improved. Right now developers respond to almost 500 user reviews per day using the Google Play API²⁷. It could be worth investigating whether a limit of 500 responses is sufficient to ban useless responses from the store such as thanking every user.

Comparing Apps and App stores: One of the open challenges is including indirect relationships in the comparisons as only direct relationships were used in work by Li et al. [115]. Next, to this, it has been noted that apps that have been developed using hybrid development tools had relatively low ratings. Future studies should be done to investigate the quality of hybrid apps and compare them with the quality of native apps. Furthermore, to get more complete results, current research approaches should focus on a market-scale analysis using a large number of apps and reviews.

References

[7] AppleInsider 2008. Apple’s app store launches with more than 500 apps. http://appleinsider.com/articles/08/07/10/apples_app_store_launches_with_more_than_500_apps.

[148] Perenson, M. 2008. Google launches android market. https://www.pcworld.com/article/152613/google_android_ships.html.

[175] Statista 2018. Number of apps available in leading app stores as of 1st quarter 2018. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores.

[123] Martin, W. et al. 2017. A survey of app store analysis for software engineering. IEEE transactions on software engineering. 43, 9 (2017), 817–847.

[59] Di Sorbo, A. et al. 2016. What would users change in my app? Summarizing app reviews for recommending software changes. Proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering (2016), 499–510.

[8] Aralikatte, R. et al. 2018. Fault in your stars: An analysis of android app reviews. Proceedings of the acm india joint international conference on data science and management of data (2018), 57–66.

[190] Wang, H. et al. 2018. Why are android apps removed from google play?: A large-scale empirical study. Proceedings of the 15th international conference on mining software repositories (2018), 231–242.

[114] Li, S. et al. 2017. Crowdsourced app review manipulation. Proceedings of the 40th international acm sigir conference on research and development in information retrieval (2017), 1137–1140.

[42] Chen, H. et al. 2017. Toward detecting collusive ranking manipulation attackers in mobile app markets. Proceedings of the 2017 acm on asia conference on computer and communications security (2017), 58–70.

[194] Xie, Z. et al. 2016. You can promote, but you can’t hide: Large-scale abused app detection in mobile app stores. Proceedings of the 32nd annual conference on computer security applications (2016), 374–385.

[146] Panichella, S. et al. 2016. Ardoc: App reviews development oriented classifier. Proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering (2016), 1023–1027.

[6] Anchiêta, R.T. and Moura, R.S. 2017. Exploring unsupervised learning towards extractive summarization of user reviews. Proceedings of the 23rd brazillian symposium on multimedia and the web (2017), 217–220.

[92] Jha, N. and Mahmoud, A. 2017. Mining user requirements from application store reviews using frame semantics. International working conference on requirements engineering: Foundation for software quality (2017), 273–287.

[70] Gao, C. et al. 2018. Online app review analysis for identifying emerging issues. 2018 ieee/acm 40th international conference on software engineering (icse) (2018), 48–58.

[144] Palomba, F. et al. 2017. Recommending and localizing change requests for mobile apps based on user reviews. Proceedings of the 39th international conference on software engineering (2017), 106–117.

[143] Palomba, F. et al. 2018. Crowdsourcing user reviews to support the evolution of mobile apps. Journal of Systems and Software. 137, (2018), 143–162.

[192] Wei, L. et al. 2017. OASIS: Prioritizing static analysis warnings for android apps based on app user reviews. Proceedings of the 2017 11th joint meeting on foundations of software engineering (2017), 672–682.

[168] Scoccia, G.L. et al. 2018. An investigation into android run-time permissions from the end users’ perspective. (2018).

[127] Mcilroy, S. 2014. Empirical studies of the distribution and feedback mechanisms of mobile app stores.

[128] McIlroy, S. et al. 2017. Is it worth responding to reviews? Studying the top free apps in google play. IEEE Software. 34, 3 (2017), 64–71.

[83] Hassan, S. et al. 2018. Studying the dialogue between users and developers of free apps in the google play store. Empirical Software Engineering. 23, 3 (2018), 1275–1312.

[115] Li, Y. et al. 2017. Mining user reviews for mobile app comparisons. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. 1, 3 (2017), 75.

[5] Ali, M. et al. 2017. Same app, different app stores: A comparative study. Proceedings of the 4th international conference on mobile software engineering and systems (2017), 79–90.

[88] Hu, H. et al. 2018. Studying the consistency of star ratings and reviews of popular free hybrid android and iOS apps. Empirical Software Engineering. (2018), 1–26.

[139] Mujahid, S. et al. 2017. Examining user complaints of wearable apps: A case study on android wear. Mobile software engineering and systems (mobilesoft), 2017 ieee/acm 4th international conference on (2017), 96–99.