Skip to content

Conversation

@a-rodin
Copy link
Contributor

@a-rodin a-rodin commented Dec 5, 2024

This PR is a part of work on pathscale/WorkTable#24.

let mut result: Vec<IndexValue<T>> = vec![];
for interval in intervals.iter() {
for index in interval.0..=interval.1 {
let mut index_records = parse_index_page::<T, PAGE_SIZE>(file, index as u32)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it looks like for every Link we are currently seeking at page start and re-reading header and then seeking to data start. I think it's not the best way to do that

Ok(result)
}

fn read_links<DataType, const PAGE_SIZE: usize>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part also looks suspicious. I think for big tables it will not work

@Handy-caT
Copy link
Collaborator

@a-rodin I think we can use iterators approach. You need to implement custom ones for this task. I think there will be two iterators: index iterator and data iterator.

Index iterator will load one index page and yield links. When page is read then next page is loaded into memory and parsed. So one page load will give data for many next calls.

For data iterator you need to add fn that will read multiple data records from single page. It will have vec of Links as argument instead of single Link. This array will be already sorted, so this fn must be used only with array of sorted Link or panic at runtime. In this case we will parse header only once and then one by one will read data records seeking only forward.

Data iterator is harder as I think. The issue now we have is frequent re-read of data and too much seeks. I think data iterator can have next logic. It will use index iterator and will get vec of Links from full page. Then we need to sort Link by their's from file start offset. In this case our seeks are always front moved, but we will get unsorted data. For solving this we can use same approach as for index: load full page of data into memory and yield it untill we need to load next one. So as result flow is:

  • load index page Link
  • allocate vector for data
  • sort Links vector with saving original order
  • divide Links by pages
  • use fn from previous part to read data records from page
  • collect data records to buffer array
  • yield data from buffer until it ends

@a-rodin
Copy link
Contributor Author

a-rodin commented Dec 25, 2024

@Handy-caT I have added three iterators, PageIterator, LinkIterator, and DataIterator.

Copy link
Collaborator

@the2pizza the2pizza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with small notes

use super::{Interval, SpaceInfo};

pub fn map_index_pages_to_general<T>(pages: Vec<IndexData<T>>) -> Vec<General<IndexData<T>>> {
let mut header = &mut GeneralHeader::new(0.into(), PageType::Index, 0.into());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By my compiler warnings it shouldn't be mut

for index in interval.0..interval.1 {
let index_page = parse_index_page::<T, PAGE_SIZE>(file, index as u32)?;
result.push(index_page.inner);
for index in interval.0..=interval.1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit doubt when I see nested loops. For improving performance I suppose we can merge intervals to reduce amount of iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants