0-Day Continuous Integration (CI) Test Service Helps Ensure Linux Code Quality

Julie Du, Software Engineering Manager, & Philip Li, Software Engineer, Intel

Linux powers experiences around us every day — from 100% of the world’s top 500 supercomputers to 90% of public clouds, 82% of smartphones and 62% of IoT devices. Yet, the Linux kernel development that enables all of this isn’t necessarily top of mind: Linux “just works.” After all, this development model has had nearly three decades to become the well-oiled machine it is today. But this was no ordinary feat — Linux’ scale, development cadence, and stability make it amazing. Thousands of developers around the globe develop, test and integrate 13,000+ patches into the Linux kernel every 9–10 weeks.

With so much code contributed to each release, it’s impossible to avoid potential regressions. To eliminate them, you must first find the bugs that cause them — akin to finding a needle in the proverbial haystack. This is what makes the 0-Day Continuous Integration (CI) Test Service so important. 0-Day delivers comprehensive, automated and continuous integration testing that monitors the Linux mainline and 800+ developer trees for regressions, helping find bugs before they reach the Linux kernel so problems can be fixed before they impact users. Simply put, 0-Day helps ensure Linux kernel quality in a highly complex development environment.

We caught up with Intel software engineers, Julie Du and Philip Li, to learn about their current work in evolving this service to help support the needs of other developers across the Linux community and ensure quality code.

QCan you give us an overview of the 0-Day CI Test Service and what it does?

Philip: You can view 0-Day as a security guard for the Linux kernel with advanced features to alert you to bugs, or regressions, before they are merged into the the kernel. Because it is an automated service, it is zero effort for kernel developers to use. Users receive an email from the 0-Day service if their code caused any issues, along with the data needed to root cause these issues.

Julie: Each Linux kernel release integrates a lot of new code. The 0-Day Continuous Integration (CI) Test Service automatically compares this new code with existing code to detect any adverse differences in behavior, known as bugs or regressions, and alerts developers via email. This service monitors various kernel trees, spanning the mainline tree, next tree key developers’ trees to detect variances. It polls over 800 public git trees and scans posts to mailing lists. Then, it performs comprehensive build, static analysis, boot, functional, performance and power tests. This capability is critical when you think about the thousands of Linux kernel developers who submit huge numbers of patches every day. Each bug requires an average of two hours of debug time to resolve. Because Linux is such a massive project, the 0-Day CI Service saves developers a lot of debug time and ensures top quality code. This service has helped resolve approximately 40,000 bugs since its inception.

QHow is the 0-Day service different from other tools that exist?

Philip: One of the advanced features that 0-Day offers is something we call code bisection. With this unique capability, the 0-Day CI Service can detect not only that a bug occurred but where it occurred, or which code caused the regression, with extreme accuracy. Linux source patches are required to be bisectable; 0-Day utilizes this requirement to automate tests and isolate the source of issues. Because each developer tree consists of the previous release, plus a set of nearly linear patches, 0-Day can automatically apply patchsets and retest until the offending patch is identified. In addition to reporting bugs and pinpointing their source, 0-Day also provides log information along with tools and test suites needed to reproduce the tests conducted and failures encountered.

Julie: Before the 0-Day service was created, there were other test systems that could detect regressions or bugs in new code, but no one knew who, or which patch, caused the regression. This brings two issues. First, because no one knew who was responsible among thousands of developers, there was no clear ownership, which made it very difficult and time-consuming for maintainers to resolve these issues. Second, even if the location of the regression was narrowed down to a specific sub-system, the developers then needed to pinpoint the bug among thousands of patches, much like searching for a needle in a haystack, which caused a lot of debug effort. With its ability to identify which code, or which patch, caused the problem, the bisect feature significantly reduces debug time for developers.

QWhat has inspired you to work on, or contribute to, 0-Day?

Philip: I’m passionate about software development and working with the open source community. I also believe in the core promise of Linux and Intel technologies. My work on 0-Day aligns with my passions. Though 0-Day, I can give to the entire community and to my developer colleagues. 0-Day impacts many, many different projects — all of the projects related to Linux benefit from the work that I’m doing. It’s very special to contribute to two sides — the external community and internal, Linux-related projects. On one side, it’s very exciting to connect with, and contribute to, the whole community. On the other side, it’s gratifying to directly contribute to and assist internal projects. Over time, I’ve seen 0-Day grow to meet new demands, and it makes me proud and excited to continue working on it and contributing to it.

Julie: When I first joined this project, Fengguang Wu, who created 0-Day, showed me many emails he had received from the community that expressed appreciation for the 0-Day project, and I could really understand his passion in conceiving of and creating this valuable tool. My greatest inspiration comes from the recognition among customers and community of the value that 0-Day can provide to them. I really want to make sure that we continue to enhance this tool and drive greater adoption of it. How can we improve the documentation, or improve processes, for our broader community? It’s great to see 0-Day being used by more and more teams and projects. It is was wonderful to see the 0-Day service mentioned as a top bug reporter in the 2017 Linux Kernel Development Report last year. I was also proud when 0-Day was able to pinpoint the source of an 8.8% regression in the 4.18 release, which was discussed in a recent LWN.net article. I’m happy to see that 0-Day is driving quality for Linux-based platforms (on Intel architecture) everywhere.

QHow is Intel contributing to 0-Day? What are your current focuses?

Julie: This year, we’ve focused on higher service availability to better assist more and more users who have started to rely on 0-Day to detect and report issues before they are merged upstream. As the 0-Day service is used by more and more teams beyond the Linux kernel, such as the Sound Open Firmware team, higher service availability becomes increasingly important.

Philip: We’ve also been focused on making the 0-Day Service easier to use. Although I think the service is very powerful, it may have a long learning curve for new users. This presents a barrier for them — they may not be willing to learn a new system. So, we’ve built a bridge between the 0-Day system and Jenkins, another widely-used open source CI tool, so users can access 0-Day in ways they may be more familiar with. We’re hopeful that this will shorten the learning curve so that a greater number of developers can benefit from the 0-Day service.

QWhat are some of the challenges that you’ve observed? How do you think these challenges can be addressed or overcome?

Julie: 0-Day was originally created for Linux kernel patches. Today, many other projects beyond the Linux kernel are using it, including user space libraries and even firmware. As more and more projects rely on this service, one of our challenges is how to make the 0-Day infrastructure more scalable — and the system more robust — to meet the needs of these various projects. We’re focused on increasing the number of project types that 0-Day can support, ensuring the service is always online, and eliminating any single points of failure. A solid back-up storage system and 24/7 system uptime are basic requirements for many businesses, such as Internet-based companies and cloud service providers (CSPs).

Philip: Our approach to achieve greater scalability is to adapt more modern solutions. For example, we are considering the use of Ceph for system storage to complement the general network file system. Because Ceph offers a distributed modern storage solution, we can avoid a single point of failure. We are continuing to explore how to address these needs effectively — now and in the future.

QCan you tell us how you’ve collaborated with others across the community to further the development of the 0-Day CI Service?

Julie: Today, we are working with many colleagues across the community to enhance the Linux Kernel Performance (LKP) test tool — which can be used as part of the 0-Day service or run independently — from both the technical and business sides. I’m proud of our collaboration with the lab at Tsing Hua University, a top university in China. Through our work, we’ve enhanced the tool’s performance analysis mechanism.

QLooking forward, give us a sneak peek into the roadmap for 0-Day. What’s next?

Julie: Our high-level roadmap includes increasing 0-Day’s effectiveness in finding issues. We want to adopt artificial intelligence (AI) to make the system smarter, do more work, assist human detection, and help differentiate good patches from bad ones. We’re developing a proof-of-concept that relies on log data to learn and predict bad patches, and determine whether or not an issue is truly related to a code change. If we can adopt a more intelligent way to detect issues, this will, in turn, benefit the project itself by pinpointing a greater number of bugs and shaping the introduction of new features.

QBefore we close, is there anything else that you’d like to add?

Philip: We view 0-Day as an engine that can speed any project that is based on the Linux kernel, since it provides very comprehensive test coverage and a unique bisection capability to help significantly reduce debug time by pinpointing culprit code changes. It is a very collaborative tool that incorporates industry test suites to ensure top quality code within the Linux kernel, and we’re continually incorporating more test capabilities into the service. The initial effort to leverage all of this is very simple — with only the URL of a Github report, 0-Day will automatically add coverage for build, boot and runtime testing to the code without any further interaction.

QFor those interested in learning more, where can they go for additional information?

Julie: If you are looking to improve Linux code quality, we think you’ll find great value in the 0-Day service and LKP test tool. We invite you to use them and share your experiences with us. We also welcome your contributions to the LKP test tool. We hope you’ll visit 01.org/lkp where you can find technical articles, test reports, blog posts and more.

Like what you read? Give Open Source Voices a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.