Skip to content

C#: Improvements to buildless extraction#3136

Merged
hvitved merged 5 commits intogithub:masterfrom
calumgrant:cs/buildless-extraction
Apr 7, 2020
Merged

C#: Improvements to buildless extraction#3136
hvitved merged 5 commits intogithub:masterfrom
calumgrant:cs/buildless-extraction

Conversation

@calumgrant
Copy link
Copy Markdown
Contributor

  • Fixes buildless extraction on .NET Core, which was broken due to differences in the way that NuGet packages were downloaded and referenced. The extractor now performs a dotnet restore to a temp directory, then reads the downloaded DLLs from that directory.
  • Reads the structure of .NET Core .csproj files.
  • A few robustness improvements - fixes a few NullReferenceExceptions
  • Extract unknown qualified expressions as "property access" instead of "unknown" expressions because this finds more results in queries.
  • Log more timing information, particularly for build analysis which can be fairly slow due to dotnet restore being slow.
  • Does not log missing types from DLLs because this is normal and does not affect source code extraction.

Fixes https://github.com/github/codeql-csharp-team/issues/14

@calumgrant calumgrant requested a review from a team as a code owner March 26, 2020 19:23
@calumgrant calumgrant added the C# label Mar 26, 2020
@calumgrant calumgrant force-pushed the cs/buildless-extraction branch from e7222c7 to 09616d1 Compare March 26, 2020 19:43
@calumgrant calumgrant force-pushed the cs/buildless-extraction branch from 09616d1 to b94b4b7 Compare March 26, 2020 20:40
@github github deleted a comment from lgtm-com bot Mar 27, 2020
@github github deleted a comment from lgtm-com bot Mar 27, 2020
Copy link
Copy Markdown
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff, Calum. Excited to see how far we can get with buildless extraction.

Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/BuildAnalysis.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/BuildAnalysis.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/BuildAnalysis.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/CsProjFile.cs
return new MemberAccess(info.SetKind(ExprKind.UNKNOWN), expression, symbol);
// Default to property access - this can still give useful results but
// the target of the expression should be checked in QL.
return new MemberAccess(info.SetKind(ExprKind.PROPERTY_ACCESS), expression, symbol);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong to me... It could just as well be a field or something else.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, it feels more like a dynamic member access, but that is also a bit of a hack. Perhaps we need new classes of unknown expressions, i.e.

| 122 = @unknown_member_access_expr
| 123 = @unknown_binary_expr
...

and then have

@member_access_expr = ... | @unknown_member_access_expr;
@bin_op = ... | @unknown_binary_expr;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice is between wrong some of the time or missing all of the time. So I would argue that the change I made is a step in the right direction that is unlikely to make things worse. I think your changes seem reasonable but I don't think this PR is the place for them, particularly since dataflow libraries and queries would need to be reworked.

Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/CsProjFile.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/CsProjFile.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/DotNet.cs Outdated
Comment thread csharp/extractor/Semmle.Extraction.CSharp.Standalone/ProgressMonitor.cs Outdated
Copy link
Copy Markdown
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a ConcurrentHashSet wrapper to Semmle.Util, and it also appears that you forgot to add the new files.

private readonly HashSet<string> usedSources = new HashSet<string>();
private readonly HashSet<string> missingSources = new HashSet<string>();
private readonly Dictionary<string, string> unresolvedReferences = new Dictionary<string, string>();
private readonly IDictionary<string, bool> usedReferences = new ConcurrentDictionary<string, bool>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add this to Semmle.Util and use instead:

using System.Collections;
using System.Collections.Concurrent;
using System.Collections.Generic;

namespace Semmle.Util
{
    public class ConcurrentHashSet<T> : IEnumerable<T>
    {
        private readonly ConcurrentDictionary<T, object> dict;
        public ConcurrentHashSet()
        {
            dict = new ConcurrentDictionary<T, object>();
        }

        public void Add(T t) => dict[t] = null;

        public bool Contains(T t) => dict.ContainsKey(t);

        public IEnumerator<T> GetEnumerator() => dict.Keys.GetEnumerator();

        IEnumerator IEnumerable.GetEnumerator() => dict.Keys.GetEnumerator();
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reservation with this is that this would be a half-baked implementation, that should also implement

[System.Serializable]
public class ConcurrentHashSet<T> : System.Collections.Generic.ICollection<T>, System.Collections.Generic.IEnumerable<T>, System.Collections.Generic.IReadOnlyCollection<T>, System.Collections.Generic.ISet<T>, System.Runtime.Serialization.IDeserializationCallback, System.Runtime.Serialization.ISerializable

So we should really just reference the nuget package ConcurrentHashSet which I presume is totally correct and bug-free.....

Then it means updating resources/lib which is a bit of a pain.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a half-baked implementation suffices, as we only need to add the things that we actually need. If we ever need, say, to use it as an ICollection<T>, then we can implement it then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's literally one place where we use ConcurrentDictionary instead of ConcurrentHashSet so I'm really not convinced it's worth the effort. Particularly because we'd need a unit test suite for the new ConcurrentHashSet class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK 👍

lock(mutex)
missingSources.Add(sourceFile.FullName);
}
sources[sourceFile.FullName] = sourceFile.Exists;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice rewrite.

@hvitved hvitved merged commit 6685a5e into github:master Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants