# Sampling Method for Big Data Join Queries

**T2015-309**

The Need

Tables of information are related using JOIN operators, a particular data point from table A is related with a data point from table B. This connectivity is a fundamental part of data analysis, allowing the user to run statistical analysis to find trends. When using large data sets too much information is presented at once, and nothing can be easily concluded from the data. An often used technique is to take a random sub set of the data, similar to polling a small group of people about how they would vote on an issue then extrapolating the results to the whole population. Similar to voting polls taking a random subset of data may not accurately represent the whole population because only one group is tested. Stratified random sampling is the process of dividing the entire data set into multiple strata, smaller equally sized subsets. A random sample is then taken from each of these smaller subsets, in each subset each data point is equally likely to be selected. This prevents certain groups from being left out of the sample.

The Technology

Dr. Arnab Nandi and his research colleagues at The Ohio State University have combined the process of random sampling and joining two large datasets.

For example, an analysis with two large sets of numbers with one set are to be paired with points in the other. The data from both tables are put into equal stratifications. Random sampling is then applied to each strata meaning each data point has equal chance of being selected with respect to the data points in its strata. In order to form a relationship a data point selected randomly from a strata in the first table is paired with a randomly selected data point from a strata in the second table. These points are then removed the larger data set eliminating the possibility of being selected again. The process is repeated until the subset is large enough to be analyzed.

Commercial Applications

- Big data
- Better database querying

Benefits/Advantages

- Evenly distributed data sampling
- Analysis of large data sets on small data sets time scales