Chapter 5 Ecosystem Analytics

5.1 Motivation

A popularized form of software reuse is the use of distributable packages. Packages are modular libraries packaged with reusable and extensible functionality. As an example, a software project in need of parsing JSON files can use a parsing package instead of developing and maintaining this functionality by themselves. Furthermore, developers can also build new packages on top of existing packages. As an example, a developer building a general-purpose parser for the web can depend and build upon the JSON parser package.

To make packages available in software projects, package managers provide an automized workflow to fetch remote packages, resolve compatibility constraints between them and then integrate them into projects. Package managers can also be seen as “batteries included” for programming languages. For example, pip9 and Cargo10, official package managers for Python and Rust, provide its users with vast community developed and tested packages. The use of package managers is also immensely popular; npm, the de-facto package manager for JavaScript, recorded 18 billion package downloads in one month [174].

Additionally, there is also an increasing trend towards creating open source projects. In the time span of one week, npm register 4,685 newly created packages [174]. An enabling factor of this trend is social coding platforms such as GitHub11, that provides an open environment for package communities to develop, review and use newly developed projects. Thus, leading to early adaption and widespread within package communities.

As a consequence, software projects that import functionality from remotely-developed packages, and also at the same time export functionality as a package for others are increasingly becoming interdependent. In a shared environment such as npm or Github, the interdependence between software projects form a global graph-like structure known as a software ecosystem, where nodes represent software projects and edges represent the dependency between them.

As the interdependence between software projects in a shared environment encompasses more than the code that they depend on; several attempts have been made to define the phenomena around software ecosystems: Messerschmitt and Szyperski [134] state that a software ecosystem is “a collection of software products that have some given degree of symbiotic relationships”. On a similar line, Lungu [117] defines a context to the symbiotic relationship between projects: “A software ecosystem is a collection of software projects which are developed and co-evolve in the same environment.” An overlooked element is the social-technical relationships between dependent projects, Mens et al. [132] incorporate this aspect to Lungu’s [117] definition with the following extension: “by explicitly considering the communities involved (e.g. user and developer communities) as being part of the software ecosystem”. Stallman [173] opposes the use of software ecosystems to describe interdependent networks of software projects: “It is inadvisable to describe the free software community, or any human community, as an ecosystem, because that word implies the absence of ethical judgment.” Although software ecosystems are subject to ethical judgments, this is not a limiting factor to capture interconnected software projects as an ecosystem.

This chapter will use Mens et al.’s [132] definition of software ecosystems as it is the most modern and comprehensive definition (i.e., also including the social interaction) to describe and study interconnected software projects.

Software ecosystems are interesting from two analytics perspectives, one being the landscape perspective of contemporary software problems and the other being the network perspepective for studying the propagation of contemporary software problems. Together they make up ecosystem analytics. The two perspectives can yield meaningful insights to identify the spread of common inefficiencies in software development processes or critical deficiency in source code. As an example, the network perspective of software ecosystems provides effective means to identify projects affected by a vulnerability originating from a distant software project.

Through the lens of a literature survey, this chapter reviews and consolidates recent research of ecosystem analytics from both a theoretical and practical standpoint. By synthesizing recent studies, practices, open challenges and applications of ecosystem analytics, we aim to equip the reader with a comprehensive overview and recommendations for future research in this field. Thus, we have formulate the following three research questions:

  • RQ1: What is the current state of the art in software analytics for ecosystem analytics?
  • RQ2: What are the practical implications of the state of the art?
  • RQ3: What are the open challenges in ecosystem analytics, for which future research is required?

Each of these research questions will be answered using recent papers written in this field of research. This chapter is structured as follows: First, the research protocol is described in detail. This includes decisions on which papers are included in the review. After this, the research questions are answered using the previously stated set of papers.

5.2 Research Protocol

We follow Kitchenham’s [104] literature review protocol to systematically arrive at relevant publications for answering the research questions in the previous section. As suggested in [104], the search strategy should be iterative with consultations of experts in the field. Our search strategy consists of the following three steps:

  • Initial seed set from an expert in the field, MSc. Joseph Hejderup
  • Searching in a digital search engine, namely Google Scholar12
  • Selection of referenced papers based on findings from the previous two steps

5.2.1 Initial seed

An expert in the field of ecosystem analytics, MSc. Joseph Hejderup provided a list of thirteen papers, tabulated in Table 4.1. For each paper in the table, we evaluate its relevance against our research questions. In total, we consider three papers not be relevant as they do not strongly focus on the interconnections between projects (Table 4.2, presents a detailed explanation).

5.2.2 Digital Search Engine

By identifying common and reoccurring keywords from the initial seed set, we construct the following Google Scholar13 queries:

  • “engineering software ecosystems” (2014)
  • “software ecosystems” AND “empirical analysis” (2018)
  • “software ecosystem” AND “empirical” (2014)
  • “software ecosystem analytics” (2014)
  • “software ecosystem” AND “analysis” (2017)

As we aim to uncover the state-of-the-art in ecosystem analytics, some queries would not return any recent results. For example, querying “software ecosystem analytics” set to 2018 or 2016 would not yield any findings. Therefore, we could only retrieve the latest publications starting from 2014 in three queries. We documment this by adding the year in brackets with the most recent publication to each query string above.

To determine the relevance of a paper, we examine the title and the abstract of a candidate paper in each query result. Our selection process is the following: first, we read the title. If we find the title to be relevant, we stop and continue with examining the next paper. If not, we read the abstract as the last step. If the abstract is not relevant, we then discard the paper and continue with the next paper. By following this process for each query result, we arrive with the following papers in Table 4.3

5.2.3 Referenced papers

Based on the selected papers from the previous two steps, we extract their references and apply the same selection strategy as for the digital search engine; we first read the title and then proceed with the abstract if it is not clear or relevant. The result is tabulated in Table 4.

5.3 Answers

In this section, we answer our research questions by synthesizing information from our selected set of papers.

5.3.1 RQ1: What is the current state of the art in software analytics for ecosystem analytics?

To answer this research question, we explore topics in ecosystem analytics. For each explored topic, we then summarize which research methods, tools, and datasets are being used.

5.3.1.1 Explored Topics

We discover the usage of trivial packages, the impact of (breaking) changes in dependencies, quality of dependencies and dependency networks to be strong area topics in ecosystem analytics.

Notable incidents like the Left-pad incident have sparked interest in studying trivial packages. Research on this topic explores the usage of trivial packages both quantitatively and qualitatively by analyzing the usage of the trivial packages and the reasons why developers choose to use them respectively. Schleuter [179] stress that trivial packages can have a profound impact on software ecosystems.

Breaking changes is another popularly researched topic. Similar to trivial packages, researchers use both quantitatively and qualitatively methods. The impact of breaking changes and the way in which developers react to these changes makes it a very important topic.

Developing metrics to measure the health of an ecosystem is another popular area. Researchers are experimenting with metrics to measure aspects such as the quality of dependencies.

Dependency networks allow researchers to study gain insights into the evolution of software ecosystems over time.

5.3.1.2 Research Methods

The studied papers cover a plethora of research methods. These methods can be divided into two categories: quantitative and qualitative.

Many quantitative research papers analyze the data in a statistical manner, using software ecosystems as their data. The types of data depend heavily on the ecosystem used for analysis. Some papers go as far as using the source code of packages [3]. While other research focusses on the meta-data of software ecosystems, such as dependency networks [99]. Another recurring research method is survival analysis, as used by [55], which can be used to estimate the survival rate of a population over time. In software engineering, this has been successfully applied to open source projects.

Qualitative research within software ecosystems aims to gain a better understanding of the interplay between developers and software ecosystems. Certain papers solely rely on the results of qualitative research whereas some papers use both quantitative research and qualitative research to triangulate their findings.

5.3.1.3 Ecosystems

As previously stated, studying dependencies between different software projects is one of the most common topics. Therefore, researchers primarily study package managers and their centralized repositories.

The most studied ecosystem is npm14, as used by Abdalkareem et al. [3], Bogart et al. [32], Decan, Mens, and Claes [54], Kikas et al. [99], and Decan, Mens and Grosjean [55]. There are a few reasons why this a popularly studied ecosystem: npm is the largest software registry, containing more than double of the next most populated package registry in 2016 [174]. Moreover, npm is the package manager of JavaScript, which is the most used programming language according to a RedMonk survey [180]. Another compelling reason is that the majority of npm packages are openly developed and hosted on GitHub [99]. This is also the case for RubyGems (Ruby) and Crates.io (Rust) [99].

Recently, Decan, Mens, and Grosjean [55] use a data service called libraries.io which includes seven different packaging ecosystems: Cargo (rust), CPAN (Perl), CRAN (R), npm (JavaScript), NuGet (.NET), Packagist (PHP) and RubyGems (Ruby). Instead of manually scraping dependency data from a single ecosystem, researchers should take advantage of a data source that unifies multiple ecosystems in one data source, allowing researchers to study one problem over multiple ecosystems at once.

Apart from package-based ecosystems, several papers study other ecosystems. Bavota et al. [17] study the Apache ecosystem containing Java projects. For evaluating health metrics, Cox et al. [50] use Maven Central. Claes et al. [45] study ten years of package incompatibilities in testing and stable distributions of Debian i386. Robbes, Lungu, and Röthlisberger [161] opted for the Squeak/Pharo ecosystem. They state that this ecosystem would provide support for answering their research questions.

5.3.1.4 Main research findings

Based on the findings of Abdalkareem et al. [3], trivial packages make up 16.8% of the npm. Although, 10% of use trivial packages, only 45% of them have test suites.

Robbes et al. [161] mention the large impact API changes can have on an ecosystem. Bogart et al. [32] studied the attitude of developers towards breaking changes in dependencies. Their main findings were that an ecosystem plays an essential role in the way developers deal with breaking changes. Both papers conclude that developers generally do not respond in time to breaking changes and as a result breaking changes can have a large impact on a software ecosystem. This conclusion is reinforced by the findings of Decan et al. [55], where frequent changes can lead to an unstable dependency network due to transitive dependencies.

Not only do developers not react in a timely fashion to breaking changes, but Robbes et al. [161] also discover that developers are also not quick to respond to API deprecation. Bavota et al. [17] suggests that updates should only be done when they consist of bug fixes, not API changes, to combat this issue.

Attempts have also been made to find a metric that establishes the health of a dependency. Cox et al. [50] contribute to this by providing a metric to establish the freshness of a dependency.

An interesting finding in the topic of package dependency networks by Kikas et al. [99] is that ecosystems, over time, become less dependent on a single popular package.

5.3.2 RQ2: What are the practical implications of the state of the art?

In this research question, we aim to find out the practical implications of the state of the art as discussed in the previous section. As many of the discussed papers are case studies, we summarize their findings in this section.

In a majority of the papers, we find that developers are slow to update their dependencies, or at times they do not do it at all. Hora et al. [87] suggest that the main reason for this is that breaking changes cannot be solved in a uniform manner throughout an ecosystem, but rather need a specific implementation for each system. We have also found that breaking changes are constantly introduced when dependencies are updated. According to Raemaekers, van Deursen and Visser [156], about 33% of releases, either minor or major, contain a breaking change. Breaking changes could pose compiling errors, thereby breaking the system that depends on it.

Developers tend to react poorly to changes in their dependencies; Kula et al. [106] have found in a survey of 4600 projects that 81.5% of the projects contain outdated dependencies with potential security risks. Not only do developers not update their dependencies, according to an empirically study conducted by McDonnell, Ray and Kim [126] on the Android API, they also do not update their codebase with respect to the changes introduced by dependencies.

In the area of ecosystem health, Constantinou and Mens [47] have researched which factors indicate that a developer is likely to abandon an ecosystem. Their study, which analyzed GitHub15 issues and commits, has found that developers are more likely to abandon a system when they 1) do not communicate with their fellow developers, 2) do not participate often in social or technical activities and 3) for an extended period of time do not react or commit any more. Another interesting characteristic of ecosystem health, studied by Kula et al. [105], is the way in which projects age over time. Their study found that the usage over time of 81.7% of 4,659 popular GitHub projects can be fitted on a function with an order higher than two.

Malloy and Power [119] have studied the transition from Python 2 to Python 3. Python 3 has been out since 2008, and the final version of Python 2 was released in 2010. Both are (almost) 10 years old. Even though, during their study, they find that most Python developers choose to maintain compatibility with both Python 2 and Python 3 by only using a subset of features from Python 3. Malloy and Power [119] state that developers are severely limiting themselves by not using the new language features of Python 3.

Another interesting topic of research is the impact tools can have on ecosystems. Among these tools are badges. Badges are annotations on software projects which display some information about a software project. One of these badges can warn developers about outdated packages. Based on the results of Trockman [185], badges can have a positive impact on the speed at which developers update their dependencies.

Overall, we can conclude that there are improvements to be made. The current method that most users use to manage their dependencies is lacking. Whether it be updating late or not updating at all, there are many risks bound to this. Dietrich, Jezek, and Brada [60] have also found that there are a lot of problems in the Java ecosystem, and has posed a set of relatively minor changes to both development tools and the Java language itself that could be very effective. These improvements are highlighted by answering the last research question.

5.3.3 RQ3: What are the open challenges in ecosystem analytics, for which future research is required?

This research question gives insight into the current open challenges in the field of ecosystem analytics. It focuses on the challenges described in the studied papers.

The most common open challenge across almost all papers is the generalization of results. Most of the studied papers use a single ecosystem on which they base their results on. This, in turn, means that it is unclear whether the results hold for other ecosystems. For example, Claes et al. [45] state that a possibility for future work is to investigate to what extent findings are transferable between package-based software distributions.

However, Decan, Mens, and Grosjean [55] state, after researching dependency network evolution for seven ecosystems, that they do not make any claims that their results can be generalized beyond the main package managers for specific languages. This is because Decan, Mens, and Grosjean [55] do not expect similar results for networks such as WordPress, as these packages tend to be more high-level (e.g. used by end users instead of reused by other packages). This is shown as well in the different results obtained by Bogart et al. [32], which shows that values differ per ecosystem. Overall, this shows that there is a lot of space for future research to be done in generalizing research beyond the already researched ecosystems.

Another persistent open challenge is the ability to determine the health of an ecosystem. Although Jansen [91] has provided OSEHO, “a framework that is used to establish the health of an open source ecosystem”, Jansen [91] notes that “there is surprisingly little literature available about open source ecosystem health”. Kikas et al. [99] agree, stating that a general goal should be to provide analytics for maintainers about the overall ecosystem trends.

Furthermore, this challenge is related to determining the health of a system. Kikas et al. [99] state that “a measure quantifying dependency health in an ecosystem should be developed”. Moreover, according to Jansen [91], determining the health of a system from an ecosystem perspective is required to determine which systems to use. This problem also ties into developing mechanisms for assisting developers in the selection of packages as well. In particular, finding the best dependency, according to the functionality needs of the existing application. Abdalkareem et al. [3] state that helping developers find the best packages suiting their needs need to be addressed. Kikas et al. [99] agree that another general goal is to provide maintainers with improved tooling to manage their dependencies.

However, whenever dependencies are chosen, another open challenge is to assist maintainers to keep dependencies up to date. In order to find out when dependencies should be updated, there is a need for developing new metrics. Bavota et al. [17] state that their observations could be a starting point to build recommenders for supporting developers in complex dependency upgrades. Cox et al. [50] provide “a metric to aid stakeholders in deciding on whether the dependencies of a system should be updated”. However, Cox et al. [50] also state multiple refinements on this metric which could still be researched.

References

[174] State of the union: Npm: 2016. https://www.linux.com/news/event/Nodejs/2016/state-union-npm. Accessed: 2018-10-11.

[134] Messerschmitt, D.G. and Szyperski, C. 2003. Software ecosystem: Understanding an indispensable technology and industry (mit press). The MIT Press.

[117] Lungu, M. 2009. Reverse engineering software ecosystems. University of Lugano.

[132] Mens, T. et al. 2013. Studying evolving software ecosystems based on ecological models. Evolving software systems. Springer Berlin Heidelberg. 297–326.

[173] Stallman, R. 2002. Free software, free society: Selected essays of richard m. stallman. Lulu. com.

[104] Kitchenham, B. 2004. Procedures for performing systematic reviews. Keele, UK, Keele University. 33, 2004 (2004), 1–26.

[179] The npm blog: Kik, left-pad and npm: 2016. https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm. Accessed: 2018-10-15.

[3] Abdalkareem, R. et al. 2017. Why do developers use trivial packages? An empirical case study on npm. Proceedings of the 2017 11th joint meeting on foundations of software engineering - ESEC/FSE 2017 (2017).

[99] Kikas, R. et al. 2017. Structure and evolution of package dependency networks. 2017 IEEE/ACM 14th international conference on mining software repositories (MSR) (May 2017).

[55] Decan, A. et al. 2018. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Software Engineering. (Feb. 2018).

[32] Bogart, C. et al. 2016. How to break an API: Cost negotiation and community values in three software ecosystems. Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering - FSE 2016 (2016).

[54] Decan, A. et al. 2017. An empirical comparison of dependency issues in OSS packaging ecosystems. 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER) (Feb. 2017).

[180] The redmonk programming language rankings: January 2018: 2018. https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/. Accessed: 2018-10-11.

[17] Bavota, G. et al. 2014. How the apache community upgrades dependencies: An evolutionary study. Empirical Software Engineering. 20, 5 (Sep. 2014), 1275–1317.

[50] Cox, J. et al. 2015. Measuring dependency freshness in software systems. 2015 IEEE/ACM 37th IEEE international conference on software engineering (May 2015).

[45] Claes, M. et al. 2015. A historical analysis of debian package incompatibilities. 2015 IEEE/ACM 12th working conference on mining software repositories (May 2015).

[161] Robbes, R. et al. 2012. How do developers react to API deprecation? Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering - FSE 12 (2012).

[87] Hora, A. et al. 2016. How do developers react to API evolution? A large-scale empirical study. Software Quality Journal. 26, 1 (Oct. 2016), 161–191.

[156] Raemaekers, S. et al. 2017. Semantic versioning and impact of breaking changes in the maven repository. Journal of Systems and Software. 129, (Jul. 2017), 140–158.

[106] Kula, R.G. et al. 2017. Do developers update their library dependencies? Empirical Software Engineering. 23, 1 (May 2017), 384–417.

[126] McDonnell, T. et al. 2013. An empirical study of API stability and adoption in the android ecosystem. 2013 IEEE international conference on software maintenance (Sep. 2013).

[47] Constantinou, E. and Mens, T. 2017. An empirical comparison of developer retention in the RubyGems and npm software ecosystems. Innovations in Systems and Software Engineering. 13, 2-3 (Aug. 2017), 101–115.

[105] Kula, R.G. et al. 2017. An exploratory study on library aging by monitoring client usage in a software ecosystem. 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER) (Feb. 2017).

[119] Malloy, B.A. and Power, J.F. 2018. An empirical analysis of the transition from python 2 to python 3. Empirical Software Engineering. (Jul. 2018).

[185] Trockman, A. 2018. Adding sparkle to social coding. Proceedings of the 40th international conference on software engineering companion proceeedings - ICSE 18 (2018).

[60] Dietrich, J. et al. 2014. Broken promises: An empirical study into evolution problems in java programs caused by library upgrades. 2014 software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE) (Feb. 2014).

[91] Jansen, S. 2014. Measuring the health of open source software ecosystems: Beyond the scope of project health. Information and Software Technology. 56, 11 (Nov. 2014), 1508–1519.