March 3, 2026Open Access

Dense and sparse partition by local density in five text corpora: observed proportions (10.0–10.5% and 90.0–90.5%)

Key Points

Dense documents represent 10.0-10.5% across five different text corpora, while sparse documents account for 90.0-90.5%.
The analysis utilized local density values derived from nearest neighbors in a 100-dimensional space of 516,556 documents.
Each of the five text corpora varied in size, with document counts ranging from 17,242 to 217,587.
Findings highlight the fixed local density threshold used for classification without exploring underlying causes.

Abstract

In five text corpora (Forums, Newsgroups, UFO, eBird Checklist, and eBird Species), totaling 516, 556 documents, each document was represented in a 100-dimensional space and assigned a local density value based on nearest neighbors. Using a fixed threshold per corpus, documents were split into two groups: dense (high local density) and sparse (the remainder). The fraction of documents classified as dense lies between 10. 0% and 10. 5% in every corpus, while the fraction classified as sparse lies between 90. 0% and 90. 5%. The five corpora differ in size (from 17, 242 to 217, 587 documents) and domain. The observation is documented in data/observationₜable. csv; the table and the proportions figure can be reproduced from the JSON files in data/ using code/reproduceₒbservation. py. This report is limited to documenting these proportions and does not interpret causes or generality.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Miguel Pavón

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Dense and sparse partition by local density in five text corpora: observed proportions (10.0–10.5% and 90.0–90.5%)

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study