CSD-RkNN: Conic Section Discriminances for Large Scale Reverse k Nearest Neighbors Queries

Overview

CSD-RkNN: Conic Section Discriminances for Large Scale Reverse k Nearest Neighbors Queries

Overview

The neighborhood relationships between spatial objects can help to understand their essential features, most geographic information systems (GIS) and location based services (LBS) therefore provide two types of spatial queries for nearest neighbor analysis: k nearest neighbors (kNN) query and reverse k nearest neighbors (Rk*NN) query. The aim of kNN queries is to find the k points closest to the query point, while for RkNN queries, all the points that consider the query point as one of their k closest points are required to be found. For some nearest neighbor analyses in geospatial applications, kNN queries are practical and easy to implement. However, when faced with some special scenarios, such as facility location, influential domain analysis and potential customer analysis, the query of RkNN is more suitable. The existing RkNN search algorithms can generally return correct results in a relatively short time (a few seconds) when faced with small-scale (which means the value of k is relatively small) query tasks. Unfortunately, once the value of k becomes relatively large, e.g., k is 1000 or greater, they can no longer respond in a tolerable time. Accordingly, the main goal of this work is to improve the performance of RkNN algorithms in the face of large-scale query tasks. According to the characteristics of conic sections, we propose a verification approach, named Conic Section Discriminance (CSD), to determine whether points belong to the RkNN set. With CSD, only a small fraction of candidates need to be verified by costly kNN queries, while the verification cost of the vast majority of candidates is only O(1). Furthermore, we propose a Voronoi based candidate generation approach to reduce the candidate set. Based on VoR-tree, we combined two proposed approaches to form a novel RkNN algorithm, termed CSD-Rk*NN. A series of experiments were conducted to compare CSD-RkNN with SLICE, the state-of-the-art RkNN algorithm, and VR-R*k*NN, the original RkNN algorithm on VoR-tree. The experimental results indicate that CSD-RkNN significantly outperforms the other two algorithms, especially when the value of k is relatively large.

Project structure

CSD-RkNN/:
├── data/: [data set]
│    ├── North America-1.txt
│    ├── North America-2.txt
│    ├── restaurant.csv
│    ├── mall.csv
│    ├── hospital.csv
│    ├── school.csv
│    └── residence.csv
├── common/: [common data structures, ploting functions and persistent dictionary]
│    ├── data_structure.py
│    ├── fig.py
│    └── persistence.py
├── index/: [spatial indices (including R-tree and VoR-tree)]
│    ├── rtree.py
│    └── vortree.py
├── rknn/: [RkNN algorithms (including CSD-RkNN, SLICE and VR-RkNN)]
│    ├── csd.py
│    ├── slice.py
│    └── vr.py
├── experiments.py [experiments (including benchmark experiments and case study experiments)]
└── test.py

Usage

Generate the facility set and user set:

>>> import numpy as np
>>> from shapely.geometry import Point
>>> from uuid import uuid1 as generate_uuid
>>> users = [(str(generate_uuid()), Point(np.random.uniform(0, 1), np.random.uniform(0, 1)))
             for i in range(1000)]
>>> facilities = [(str(generate_uuid()), Point(np.random.uniform(0, 1), np.random.uniform(0, 1)))
                  for i in range(1000)]

Index the facilities and users:

>>> from index.vortree import VoRtreeIndex
>>> user_index = VoRtreeIndex(data=users)
>>> facility_index = VoRtreeIndex(data=facilities)

Choose one of these facilities as the query facility:

>>> from random import choice
>>> q = facility_index.nodes[choice(facilities)[0]]

Retrieve the Bi-RkNNs of the query facility from the user set (k=50):

>>> from rknn import csd
>>> bi_rknn = [(node.uuid,node.geom) for node in csd.BiRkNN(q, 50, facility_index, user_index)]
>>> print(bi_rknn)
[('454cfe0a-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f106a8b0>), 
 ('454cff04-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f361ee80>), 
 ...,
 ('454d02b0-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f361efd0>)]

Retrieve the Mono-RkNNs of the query facility from the facility set(k=50):

>>> mono_rknn=list(csd.MonoRkNN(q,50,facility_index))
>>> print(mono_rknn)
[('454d0e40-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f3621460>), 
 ('454d158e-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f36215e0>), 
 ...,
 ('454d77b8-527f-11ec-87c0-80e650182120', <shapely.geometry.point.Point object at 0x7ff5f3629a60>)]

Plot the result:

>>> import matplotlib.pyplot as plt
>>> fig = plt.figure(figsize=(10, 5))bi_rknn_ax = fig.add_subplot(121)
>>> bi_rknn_ax.scatter([u[1].x for u in users], [u[1].y for u in users], marker='.', c='gray', s=8,
                       linewidths=0,label='User')
>>> bi_rknn_ax.scatter([f[1].x for f in users], [f[1].y for f in facilities], marker='.', c='blue',
                       s=8, linewidths=0, label='Facility')
>>> bi_rknn_ax.scatter(q.geom.x, q.geom.y, marker='*', c='red', s=80, linewidths=0, label='Query facility')
>>> bi_rknn_ax.scatter([b[1].x for b in bi_rknn], [b[1].y for b in bi_rknn], marker='.', c='green',
                       s=30, linewidths=0, label='RkNN')
>>> bi_rknn_ax.legend(framealpha=1,loc=1)
>>> bi_rknn_ax.set_title('Bi-R$k$NN')
>>> mono_rknn_ax = fig.add_subplot(122)
>>> mono_rknn_ax.scatter([f[1].x for f in users], [f[1].y for f in facilities], marker='.', c='gray',
                         s=8, linewidths=0,label='User')
>>> mono_rknn_ax.scatter(q.geom.x, q.geom.y, marker='*', c='red', s=80, linewidths=0, label='Query facility')
>>> mono_rknn_ax.scatter([m[1].x for m in mono_rknn], [m[1].y for m in mono_rknn], marker='.', c='green',
                         s=30, linewidths=0, label='RkNN')
>>> mono_rknn_ax.legend(framealpha=1, loc=1)
>>> mono_rknn_ax.set_title('Mono-R$k$NN')
>>> plt.show()

Evaluation

If you want to evaluate the performance of CSD-RkNN against other algorithms (e.g., SLICE and VR-RkNN), you can run the following codes.

>>> # effect of data size on Mono-RkNN
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_size_on_MonoRkNN(10)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_size_on_MonoRkNN(1000)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> 
>>> # effect of data size on Bi-RkNN
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_size_on_BiRkNN(10)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_size_on_BiRkNN(1000)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> 
>>> # effect of k on Mono-RkNN
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_k_on_MonoRkNN('Synthetic')
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_k_on_MonoRkNN('Real')
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)
>>> 
>>> # effect of k on Bi-RkNN
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_k_on_BiRkNN('Synthetic')
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_k_on_BiRkNN('Real')
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)
>>> 
>>> # effect of number of users relative to number of facilities
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_user_num_relative_to_facility_num(10)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_user_num_relative_to_facility_num(1000)
>>> plot_dual_distribution(time_cost)
>>> plot_dual_distribution(io_cost)

>>> # effect of data distribution
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_distribution(10)
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)
>>> time_cost, io_cost = experiments.BenchmarkExperiments.evaluate_effect_of_data_distribution(1000)
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)

>>> # evaluate RkNN queries for restaurant
>>> time_cost, io_cost = experiments.CaseStudyExperiments.evaluate_RkNN_for_restaurant()
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)

>>> # evaluate RkNN queries for mall
>>> time_cost, io_cost = experiments.CaseStudyExperiments.evaluate_RkNN_for_mall()
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)

>>> # evaluate RkNN queries for hospital
>>> time_cost, io_cost = experiments.CaseStudyExperiments.evaluate_RkNN_for_hospital()
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)

>>> # evaluate RkNN queries for school
>>> time_cost, io_cost = experiments.CaseStudyExperiments.evaluate_RkNN_for_school()
>>> plot_single_distribution(time_cost)
>>> plot_single_distribution(io_cost)

Alternatively, you can run test.py on the terminal directly:

 python3 test.py
You might also like...

MODDA: a drug repositioning method based on a large-scale multi-omics heterogeneous network

MODDA: a drug repositioning method based on a large-scale multi-omics heterogeneous network

MODDA Code and Dataset for "MODDA: a drug repositioning method based on a large-scale multi-omics heterogeneous network". Reference If you make advant

Jun 12, 2022

SIMS: Scalable, Interpretable Models for Cell Annotation of large scale single-cell RNA-seq data

SIMS: Scalable, Interpretable Modeling for Single-Cell RNA-Seq Data Classification SIMS is a pipeline for building interpretable and accurate classifi

May 12, 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

Sep 26, 2022

PyTorch implementation of the method presented in "Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation"

PyTorch implementation of the method presented in

DeepViewAgg [CVPR 2022 Oral] Official repository for Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation paper 📄 sel

Sep 22, 2022

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

Sep 17, 2022

Personal codes for Large-scale Structures measurement

lssbox personal codes for Large-scale Structures measurement Installation create a clean conda environment conda create -n lssbox python=3.8 conda act

Jun 26, 2022

Repo for external large-scale work

Metaseq A codebase for working with Open Pre-trained Transformers. Using OPT with 🤗 Transformers The OPT 125M--30B models are now available in Huggin

Oct 1, 2022

Brain Agent for Large-Scale and Multi-Task Agent Learning

Brain Agent for Large-Scale and Multi-Task Agent Learning

Brain Agent Brain Agent is a distributed agent learning system for large-scale and multi-task reinforcement learning, developed by Kakao Brain. Brain

Sep 20, 2022

DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Sep 24, 2022
Owner
High-performance Spatial Computational Intelligence Lab @ China University of Geosciences (Wuhan)
High-performance Spatial Computational Intelligence Lab @ China University of Geosciences (Wuhan)
A simple python library that can be used to run large Web3 queries on Ethereum blockchain concurrently as per Ethereum JSON-RPC specification.

aio-eth - Asynchronous JSON-RPC client for Ethereum A simple python library that can be used to run large Web3 queries on Ethereum blockchain concurre

Narasimha Prasanna HN 5 Jul 13, 2022
This app lets you create your own gamebook, without the hassle of assigning the random section numbers yourself!

Make Your Own Gamebook This app lets you create your own gamebook, without the hassle of assigning the random section numbers yourself! Make Your Own

null 1 Apr 23, 2022
Data analytics of Z-boson interaction given energy (GeV), cross section (nb) and uncertainty on the measurement.

Data analytics of Z-boson interaction given energy (GeV), cross section (nb) and uncertainty on the measurement. A series of chi squared and minimization techniques are used to generate optimal values of lifetime and reduced chi squared as well as plotting and generating a mesh of uncertainties. Further description is provided in the header of the python script.

Elise Acher 1 May 29, 2022
Automatically fetch latest content of "The world in brief" section of The Economist.

English|中文 The world in brief 2022-07-16 Catch up quickly on the global stories that matter Origin: https://www.economist.com/the-world-in-brief After

Ariel 36 Sep 24, 2022
A Fast Calculator for (+) (-) (*) (%) (/) (sq) (abs) (sin, cos, tan, cot) and more much section taht is developing

a Fast Calculator for (+) (-) (*) (%) (/) (sq) (abs) (sin, cos, tan, cot) and more much section taht is developing. We need to you for developing. read trems of edit code in Version File of Program.

Am J 1 Sep 10, 2022
Official implementation for the paper "On deceiving malware classification with section injection"

On deceiving malware classification with section injection This repo provides the official implementation for "On deceiving malware classification wit

Adeilson Silva 22 Sep 12, 2022
Source code for the 2022 CIVEMSA paper "Biliary atresia detection using color clustering and nearest neighbor classification: A user interactive approach"

AVB Python/PyTorch source code for the paper: A. Genovese, X. Bushi, L. D'Antiga, M. Lazzaroni, G. Mawi, E. Nicastro, V. Piuri, A. Scocciolini, F. Sc

Angelo Genovese 1 May 3, 2022
Code for paper "Nearest Neighbor Knowledge Distillation for Neural Machine Translation" by Zhixian Yang, Renliang Sun, and Xiaojun Wan. This paper is accepted by NAACL 2022 Main Conference

Nearest Neighbor Knowledge Distillation for Neural Machine Translation Code for our NAACL 2022 paper Nearest Neighbor Knowledge Distillation for Neura

null 23 Sep 25, 2022
[AIM & ECCVW 2022] Fast Nearest Convolution for Real-Time Image Super-Resolution

Fast Nearest Convolution for Real-Time Image Super-Resolution, AIM & ECCV Workshops 2022 Update [2022.08.25] We have uploaded the pretrained model in

balabala 33 Sep 26, 2022
A CLI Python project to create an efficient mail truck delivery route between three trucks and two drivers. The route is determined by the Nearest Neighbor algorithm.

SCENARIO The Western Governors University Parcel Service (WGUPS) needs to determine an efficient route and delivery distribution for their Daily Local

null 1 Aug 31, 2022