Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ concat tables to create null columns #2452

Open
joocer opened this issue Feb 24, 2025 · 0 comments
Open

✨ concat tables to create null columns #2452

joocer opened this issue Feb 24, 2025 · 0 comments

Comments

@joocer
Copy link
Contributor

joocer commented Feb 24, 2025

If you concatenate a table with 100 rows and 3 columns with a table with 0 rows and 5 columns using pyarrow.concat_tables, the result will be as follows:

What Happens:

  1. Column Matching by Name:

Columns are matched by name across both tables.

If a column is present in one table but not the other, it is filled with null values in the table that lacks it.

  1. Schema Unification:

The resulting table will have columns from both tables.

Columns unique to the second table (0 rows) will be added to the first table with null values.

  1. Row Count:

The resulting table will have the total number of rows from all concatenated tables, which in this case is 100 rows (since the second table has 0 rows).

Example:

import pyarrow as pa

Table with 100 rows and 3 columns

table1 = pa.table({
'a': list(range(100)),
'b': [x * 2 for x in range(100)],
'c': [x % 2 for x in range(100)]
})

Table with 0 rows and 5 columns

table2 = pa.table({
'a': pa.array([], type=pa.int32()),
'b': pa.array([], type=pa.int32()),
'c': pa.array([], type=pa.int32()),
'd': pa.array([], type=pa.string()),
'e': pa.array([], type=pa.float32())
})

Concatenate tables

result = pa.concat_tables([table1, table2])

print(result.schema)
print(result)

Output:

a: int64
b: int64
c: int64
d: string
e: float32

a: [[0, 1, 2, ..., 99]]
b: [[0, 2, 4, ..., 198]]
c: [[0, 1, 0, ..., 1]]
d: [[null, null, ..., null]]
e: [[null, null, ..., null]]

Key Observations:

  1. Schema Resolution:

The resulting schema is a union of the schemas from both tables.

Columns d and e are added from table2, even though they don't exist in table1.

  1. Filling with Nulls:

Columns d and e are filled with null values in table1's rows.

Columns a, b, and c are unchanged and retain their values.

  1. Row Count:

The resulting table has 100 rows, matching the non-empty table's row count.

Why This Happens:

pyarrow.concat_tables aligns the columns by name and fills in null values where columns are missing.

This allows you to concatenate tables with different schemas without losing any columns.

Practical Use Case:

This behavior is useful when:

You have sparse data where not all columns are present in every partition.

You're combining data from multiple sources with slightly different schemas.

Would you like more details or examples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant