This looks like a disagreement about the meaning of the word “bias”. Bias means “prejudice in favor of or against one thing, person, or group compared with another”. In order to apply bias to statistics, you need to discriminate between one group of things and another. You do that by being selective about the samples you include.
You are arguing in favor of being selective about which samples you include. You go on by saying that we should select “top software” samples from “people who ‘know’ how to do software”.
In order to do that, you have to define “top software” and “people who ‘know’ how to do software”, and you have to favor those groups over other groups to select your samples.
That is a textbook example of bias.
I think what you’re really suggesting is that we need to filter out projects which may be created by people who don’t know what they’re doing.
That filtering process though is subject to bias, which could invalidate the results, unless you can come up with an indiscriminate algorithm that maybe only includes projects with active maintainers who have x-years’ experience contributing to active projects on GitHub.
By indiscriminate, I mean that instead of a person making the judgement about who qualifies as “people who ‘know’ how to do software”, you let the algorithm judge the experience of the developer based on their actual contribution history.
If you’d like to come up with that algorithm and share the source and data results, I’ll be happy to take a look.
Just don’t hand-select projects, because that is bias.