Eden is a user on queer.cloud. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Eden @eden@queer.cloud

There is nothing as satisfying in programming as coming back to an old project and getting very slightly further.

Last year I was trying to create new vectors in from non-sequential patterns of input. And I hated it. (Because if you read in a CSV and then want to work with the data, at some point you need to access a column of the data, and putting vectors in vectors is *very bad*)


And now I have an okay-ish way of dealing with that problem, because I'm now a lot more comfortable with using unsafe.

· Web · 0 · 3

@eden bit of a naive answer, but wouldn't that be best served by converting the CSV into a database? that way, you could look up by both row (index) and key (column).

@trwnh I think you could? I'm trying to think of what kind of database would work, and how that'd play out.

Python & R's native dataframes do the Vector of Vector approach, which is how they do semi-fast lookup but keep types separate. It's why R describes their data.frames as lists of lists. It works out to fairly slow in Rust.

Data.table in R uses something similar to what I'm doing, but adds in partitions so that you can look up sections of data more quickly, like index/partition in a standard db. I don't think I have indexes.

And SQL is really way out because the time taken to do query -> interpret -> validate -> return would be considerably longer than the lookup.

@trwnh So I think, definitely, yeah, you could, but because I'm going for speed over usability I'm happier throwing it into 2D vectors and working in custom interfaces to look up the values I need.

@eden alternate naive answer: take your CSV and create arrays for your columns instead of your rows? That way, you can access all key-value pairs by referring to column1var[n] and column2var[n], i think. If you need to find n, then you simply search through rowlabelvar[] for the index of the specific row

@trwnh You can do that if you know how many columns you have ahead of time, as Rust won't allow you to dynamically create additional variables at runtime. That's why you need to either go with Vector of Vector or a really long Vector.

But I do currently go with the vectors for the columns not the rows, it cuts down on a lot of work :)

@eden why would you need to know how many columns there are? maybe how many rows there are...

also i only meant array in an organizational sense, so if Vector is an array that can grow, then yeah, i'd use that to prevent having to make new arrays every time the dataset added a new row. but the important bit is being able to use the row index for constant lookup times. (of course, getting the row index might not be constant time...)

@trwnh You said to create arrays for your columns, which implies a type of

X : vector<i32> = (column one data)

Which would fit with extending your array every time you add a row.

But you'd need to repeat that line for each column. Which would require you to know how many columns you have ahead of time, as you can't metaprogram assigning a variable. You also struggle with putting all rows into one variable, as csvs allow multiple types. I went with two vectors, one for integers and one for floats.

@eden hmm, couldn't you use a for loop? for every row, store the column values in the appropriate structure. That could get unwieldy if you have many columns, yeah... but i was thinking you could get the number of columns in the first row, then use a nested for loop. type inference could be done later, if you didn't know the dataset beforehand.

@eden i should reiterate that i'm only offering naive answers though, because i don't know enough about the problem and i'm not really a programmer, i just studied computer science for 4 years

@trwnh You can't really do type inference in Rust! You have to know everything upfront.

Let's imagine it was a CSV of just ints and floats, which is what it is. There's six columns. So I want a structure that can fit that

struct {
col 6

But I don't have any type annotations, so Rust will throw an error. I could say it's generic using <T>, but then *all* the columns would have to be type T, and some might be int and some are float.

So the nested for loop issue runs afoul of metaprogramming.

ideally you want something like

for col in columns {
'col' + col.name = col.values

And the names of col{1..6} are given by the amount of columns I pass.

@trwnh But Rust hates this too.

It doesn't know how many columns are going to be allocated like this, it certainly doesn't know what types to give them, and it doesn't know when they'll go away.

The big thing for Rust is memory safety, so if you allocate something, the compiler has to work out when it'll be de-allocated.

The compiler right now can't work out how many thing will be allocated, so it can't work out the opposite either. So it's not passable in Rust at all