In five text corpora (Forums, Newsgroups, UFO, eBird Checklist, and eBird Species), totaling 516, 556 documents, each document was represented in a 100-dimensional space and assigned a local density value based on nearest neighbors. Using a fixed threshold per corpus, documents were split into two groups: dense (high local density) and sparse (the remainder). The fraction of documents classified as dense lies between 10. 0% and 10. 5% in every corpus, while the fraction classified as sparse lies between 90. 0% and 90. 5%. The five corpora differ in size (from 17, 242 to 217, 587 documents) and domain. The observation is documented in data/observationₜable. csv; the table and the proportions figure can be reproduced from the JSON files in data/ using code/reproduceₒbservation. py. This report is limited to documenting these proportions and does not interpret causes or generality.
Building similarity graph...
Analyzing shared references across papers
Loading...
Miguel Pavón
Building similarity graph...
Analyzing shared references across papers
Loading...
Miguel Pavón (Wed,) studied this question.
www.synapsesocial.com/papers/69a75bbbc6e9836116a239e2 — DOI: https://doi.org/10.5281/zenodo.18407378