FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526
FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526ffelixg wants to merge 6 commits intomicrosoft:mainfrom
Conversation
|
I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit. This brings arrow performance on linux for nvarchars in line with what it should be. If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace |
There was a problem hiding this comment.
Pull request overview
This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).
Changes:
- Add
simdutf(viafind_packageorFetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths. - Replace a number of
std::wstringusages withstd::u16stringfor connection strings, queries, and SQL_C_WCHAR parameter buffers. - Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
mssql_python/pybind/CMakeLists.txt |
Adds simdutf dependency resolution and links it into the ddbc_bindings module. |
mssql_python/pybind/ddbc_bindings.h |
Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string. |
mssql_python/pybind/ddbc_bindings.cpp |
Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf. |
mssql_python/pybind/connection/connection.h |
Updates connection APIs/state to store connection strings as std::u16string. |
mssql_python/pybind/connection/connection.cpp |
Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly. |
mssql_python/pybind/connection/connection_pool.h |
Updates pooling key types and APIs to use std::u16string. |
mssql_python/pybind/connection/connection_pool.cpp |
Implements pooling with std::u16string connection string keys. |
mssql_python/pybind/unix_utils.h |
Removes the SQLWCHARToWString declaration. |
mssql_python/pybind/unix_utils.cpp |
Removes the SQLWCHARToWString implementation. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/ddbc_bindings.cpp:583
- In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses
data()+SQL_NTSand setsbufferLengthtosize() * sizeof(SQLWCHAR). ForSQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (preferc_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.
Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.
std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
paramBuffers, param.cast<std::u16string>());
LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
"length=%zu characters, buffer=%zu bytes",
paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
dataPtr = sqlwcharBuffer->data();
bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
*strLenOrIndPtr = SQL_NTS;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1487-1495 1487 LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
1488 ErrorInfo errorInfo;
1489 if (retcode == SQL_INVALID_HANDLE) {
1490 LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1491 errorInfo.ddbcErrorMsg = "Invalid handle!";
1492 return errorInfo;
1493 }
1494 assert(handle != 0);
1495 SQLHANDLE rawHandle = handle->get();Lines 1562-1570 1562 return records;
1563 }
1564
1565 // Wrap SQLExecDirect
! 1566 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
1567 LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
1568 "query_length=%zu chars",
1569 (void*)StatementHandle->get(), Query.length());
1570 if (!SQLExecDirect_ptr) {Lines 1579-1587 1579 SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
1580 (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
1581 }
1582
! 1583 SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
1584 SQLRETURN ret = SQLExecDirect_ptr(StatementHandle->get(), queryPtr, SQL_NTS);
1585 if (!SQL_SUCCEEDED(ret)) {
1586 LOG("SQLExecDirect: Query execution failed - SQLRETURN=%d", ret);
1587 }Lines 3624-3632 3624 sizeof(DateTimeOffset) * fetchSize,
3625 buffers.indicators[col - 1].data());
3626 break;
3627 default:
! 3628 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3629 std::ostringstream errorString;
3630 errorString << "Unsupported data type for column - " << columnName.c_str()
3631 << ", Type - " << dataType << ", column ID - " << col;
3632 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 3633-3641 3633 ThrowStdException(errorString.str());
3634 break;
3635 }
3636 if (!SQL_SUCCEEDED(ret)) {
! 3637 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3638 std::ostringstream errorString;
3639 errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
3640 << dataType << ", column ID - " << col;
3641 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 3961-3969 3961 break;
3962 }
3963 default: {
3964 const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 3965 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3966 std::ostringstream errorString;
3967 errorString << "Unsupported data type for column - " << columnName.c_str()
3968 << ", Type - " << dataType << ", column ID - " << col;
3969 LOG("FetchBatchData: %s", errorString.str().c_str());Lines 4063-4071 4063 case SQL_SS_TIMESTAMPOFFSET:
4064 rowSize += sizeof(DateTimeOffset);
4065 break;
4066 default:
! 4067 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4068 std::ostringstream errorString;
4069 errorString << "Unsupported data type for column - " << columnName.c_str()
4070 << ", Type - " << dataType << ", column ID - " << col;
4071 LOG("calculateRowSize: %s", errorString.str().c_str());mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%
mssql_python.pybind.ddbc_bindings.h: 64.6%🔗 Quick Links
|
|
Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope. I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path. |
|
Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| static_assert(sizeof(SQLWCHAR) == sizeof(char16_t), "SQLWCHAR must be 16-bit"); | ||
|
|
||
| if (length > 0) { | ||
| std::memcpy(utf16.data(), value, length * sizeof(SQLWCHAR)); |
Work Item / Issue Reference
Summary