Skip to content

FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526

Open
ffelixg wants to merge 6 commits intomicrosoft:mainfrom
ffelixg:simdutf
Open

FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526
ffelixg wants to merge 6 commits intomicrosoft:mainfrom
ffelixg:simdutf

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented Apr 15, 2026

Work Item / Issue Reference

GitHub Issue: #514


Summary

  • Use external dependency simdutf to decode utf16 when decoding in c++
  • Use pybind11 support for std::u16string to decode utf16 when interacting with python
  • Remove SQLWCHARToWString, WideToUTF8 and Utf8ToWString, refactor call sites

Copilot AI review requested due to automatic review settings April 15, 2026 23:50
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 15, 2026

I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit.

This brings arrow performance on linux for nvarchars in line with what it should be.

If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace WStringToSQLWCHAR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).

Changes:

  • Add simdutf (via find_package or FetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths.
  • Replace a number of std::wstring usages with std::u16string for connection strings, queries, and SQL_C_WCHAR parameter buffers.
  • Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mssql_python/pybind/CMakeLists.txt Adds simdutf dependency resolution and links it into the ddbc_bindings module.
mssql_python/pybind/ddbc_bindings.h Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string.
mssql_python/pybind/ddbc_bindings.cpp Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf.
mssql_python/pybind/connection/connection.h Updates connection APIs/state to store connection strings as std::u16string.
mssql_python/pybind/connection/connection.cpp Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly.
mssql_python/pybind/connection/connection_pool.h Updates pooling key types and APIs to use std::u16string.
mssql_python/pybind/connection/connection_pool.cpp Implements pooling with std::u16string connection string keys.
mssql_python/pybind/unix_utils.h Removes the SQLWCHARToWString declaration.
mssql_python/pybind/unix_utils.cpp Removes the SQLWCHARToWString implementation.
Comments suppressed due to low confidence (1)

mssql_python/pybind/ddbc_bindings.cpp:583

  • In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses data() + SQL_NTS and sets bufferLength to size() * sizeof(SQLWCHAR). For SQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (prefer c_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.

Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.

                    std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
                        paramBuffers, param.cast<std::u16string>());
                    LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
                        "length=%zu characters, buffer=%zu bytes",
                        paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
                    dataPtr = sqlwcharBuffer->data();
                    bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
                    strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
                    *strLenOrIndPtr = SQL_NTS;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/ddbc_bindings.h
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 16, 2026

📊 Code Coverage Report

🔥 Diff Coverage

94%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 6902 out of 26791
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (94.3%): Missing lines 1491,1566,1583,3628,3637,3965,4067
  • mssql_python/pybind/ddbc_bindings.h (95.5%): Missing lines 47

Summary

  • Total: 150 lines
  • Missing: 8 lines
  • Coverage: 94%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1487-1495

  1487     LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
  1488     ErrorInfo errorInfo;
  1489     if (retcode == SQL_INVALID_HANDLE) {
  1490         LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1491         errorInfo.ddbcErrorMsg = "Invalid handle!";
  1492         return errorInfo;
  1493     }
  1494     assert(handle != 0);
  1495     SQLHANDLE rawHandle = handle->get();

Lines 1562-1570

  1562     return records;
  1563 }
  1564 
  1565 // Wrap SQLExecDirect
! 1566 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
  1567     LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
  1568         "query_length=%zu chars",
  1569         (void*)StatementHandle->get(), Query.length());
  1570     if (!SQLExecDirect_ptr) {

Lines 1579-1587

  1579         SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
  1580                            (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
  1581     }
  1582 
! 1583     SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
  1584     SQLRETURN ret = SQLExecDirect_ptr(StatementHandle->get(), queryPtr, SQL_NTS);
  1585     if (!SQL_SUCCEEDED(ret)) {
  1586         LOG("SQLExecDirect: Query execution failed - SQLRETURN=%d", ret);
  1587     }

Lines 3624-3632

  3624                                      sizeof(DateTimeOffset) * fetchSize,
  3625                                      buffers.indicators[col - 1].data());
  3626                 break;
  3627             default:
! 3628                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3629                 std::ostringstream errorString;
  3630                 errorString << "Unsupported data type for column - " << columnName.c_str()
  3631                             << ", Type - " << dataType << ", column ID - " << col;
  3632                 LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 3633-3641

  3633                 ThrowStdException(errorString.str());
  3634                 break;
  3635         }
  3636         if (!SQL_SUCCEEDED(ret)) {
! 3637             std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3638             std::ostringstream errorString;
  3639             errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
  3640                         << dataType << ", column ID - " << col;
  3641             LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 3961-3969

  3961                     break;
  3962                 }
  3963                 default: {
  3964                     const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 3965                     std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3966                     std::ostringstream errorString;
  3967                     errorString << "Unsupported data type for column - " << columnName.c_str()
  3968                                 << ", Type - " << dataType << ", column ID - " << col;
  3969                     LOG("FetchBatchData: %s", errorString.str().c_str());

Lines 4063-4071

  4063             case SQL_SS_TIMESTAMPOFFSET:
  4064                 rowSize += sizeof(DateTimeOffset);
  4065                 break;
  4066             default:
! 4067                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4068                 std::ostringstream errorString;
  4069                 errorString << "Unsupported data type for column - " << columnName.c_str()
  4070                             << ", Type - " << dataType << ", column ID - " << col;
  4071                 LOG("calculateRowSize: %s", errorString.str().c_str());

mssql_python/pybind/ddbc_bindings.h

  43 #endif
  44 
  45 inline std::string utf16LeToUtf8Alloc(std::u16string utf16) {
  46     if (utf16.empty()) {
! 47         return {};
  48     }
  49 
  50     simdutf::result utf8Length =
  51         simdutf::utf8_length_from_utf16le_with_replacement(utf16.data(), utf16.size());


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%
mssql_python.pybind.ddbc_bindings.h: 64.6%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 16, 2026

Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ffelixg ffelixg changed the title PERF: Use simdutf for decoding utf16 PERF: Handle utf16 strings using simdutf and std::u16string Apr 19, 2026
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 19, 2026

@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope.

I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path.

@gargsaumya
Copy link
Copy Markdown
Contributor

Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you.
I’ve approved the workflow for now so we can run the pipeline and get a passing test run.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

static_assert(sizeof(SQLWCHAR) == sizeof(char16_t), "SQLWCHAR must be 16-bit");

if (length > 0) {
std::memcpy(utf16.data(), value, length * sizeof(SQLWCHAR));
@gargsaumya gargsaumya changed the title PERF: Handle utf16 strings using simdutf and std::u16string FIX: [PERF] Handle utf16 strings using simdutf and std::u16string Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants