Skip to content

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443

Open
tanishq-chugh wants to merge 8 commits intoapache:masterfrom
tanishq-chugh:HIVE-29551
Open

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443
tanishq-chugh wants to merge 8 commits intoapache:masterfrom
tanishq-chugh:HIVE-29551

Conversation

@tanishq-chugh
Copy link
Copy Markdown
Contributor

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
logTypeWarning(colName, type);
} else {
nonPrimColNames.add(colName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

} else {
colTypes.add(type);
}
Map<String, String> colTypeMap = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

  • ColumnStatsSemanticAnalyzer#getColumnName
  • Utilities.getColumnNamesFromFieldSchema
    The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

return rwt;
}

private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

return new StatsEligibleColumns(colNames, colTypes);
}

private List<String> getColumnName(ASTNode tree) throws SemanticException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

Comment on lines +245 to +246
colNames.clear();
colNames.addAll(primColNames);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the argument can be avoided when implementing my other comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

}

protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

colNames = statsCols.columnNames();
} else {
colNames = getColumnName(ast);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

  1. Collect the column names as string using the original method
  2. Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
  3. Collect the columns as List<FieldSchema>
  4. The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

default:
if (tree.getChildCount() != 3) {
throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
+ " either 2 or 3. Found : " + tree.getChildCount());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the exception message in commit - 84d81f9

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants